Crack the PyTorch interview. Practice implementing operators and architectures from scratch.
Like LeetCode, but for tensors. Instant feedback. Reference solutions.
| # | Problem | Difficulty | Freq | Key Concepts |
|---|---|---|---|---|
| 01 | ReLU | Easy | Activation functions, element-wise ops | |
| 02 | Softmax | Easy | Numerical stability, exp/log tricks | |
| 16 | Cross-Entropy Loss | Easy | Log-softmax, logsumexp trick | |
| 17 | Dropout | Easy | Train/eval mode, inverted scaling | |
| 18 | Embedding | Easy | Lookup table, weight[indices] | |
| 19 | GELU | Easy | Gaussian error linear unit, torch.erf | |
| 20 | Kaiming Init | Easy | std = sqrt(2/fan_in), variance scaling | |
| 21 | Gradient Clipping | Easy | Norm-based clipping, direction preservation | |
| 31 | Gradient Accumulation | Easy | Micro-batching, loss scaling | |
| 40 | Linear Regression | Medium | Normal equation, GD from scratch, nn.Linear | |
| 03 | Linear Layer | Medium | y = xW^T + b, Kaiming init, nn.Parameter | |
| 04 | LayerNorm | Medium | Normalization, running stats, affine transform | |
| 07 | BatchNorm | Medium | Batch vs layer statistics, train/eval behavior | |
| 08 | RMSNorm | Medium | LLaMA-style norm, simpler than LayerNorm | |
| 15 | SwiGLU MLP | Medium | Gated FFN, SiLU(gate) * up, LLaMA/Mistral-style | |
| 22 | Conv2d | Medium | Convolution, unfold, stride/padding |
| # | Problem | Difficulty | Freq | Key Concepts |
|---|---|---|---|---|
| 23 | Cross-Attention | Medium | Encoder-decoder, Q from decoder, K/V from encoder | |
| 05 | Scaled Dot-Product Attention | Hard | softmax(QK^T/sqrt(d_k))V, the foundation | |
| 06 | Multi-Head Attention | Hard | Parallel heads, split/concat, projection matrices | |
| 09 | Causal Self-Attention | Hard | Autoregressive masking with -inf, GPT-style | |
| 10 | Grouped Query Attention | Hard | GQA (LLaMA 2), KV sharing across heads | |
| 11 | Sliding Window Attention | Hard | Mistral-style local attention, O(n*w) complexity | |
| 12 | Linear Attention | Hard | Kernel trick, O(n*d^2) | |
| 14 | KV Cache Attention | Hard | Incremental decoding, cache K/V, prefill vs decode | |
| 24 | RoPE | Hard | Rotary position embedding, relative position via rotation | |
| 25 | Flash Attention | Hard | Tiled attention, online softmax, memory-efficient |
| # | Problem | Difficulty | Freq | Key Concepts |
|---|---|---|---|---|
| 26 | LoRA | Medium | Low-rank adaptation, frozen base + BA update | |
| 27 | ViT Patch Embedding | Medium | Image to patches to linear projection | |
| 13 | GPT-2 Block | Hard | Pre-norm, causal MHA + MLP (4x, GELU), residual | |
| 28 | Mixture of Experts | Hard | Mixtral-style, top-k routing, expert MLPs |
| # | Problem | Difficulty | Freq | Key Concepts |
|---|---|---|---|---|
| 29 | Adam Optimizer | Medium | Momentum + RMSProp, bias correction | |
| 30 | Cosine LR Scheduler | Medium | Linear warmup + cosine annealing |
| # | Problem | Difficulty | Freq | Key Concepts |
|---|---|---|---|---|
| 32 | Top-k / Top-p Sampling | Medium | Nucleus sampling, temperature scaling | |
| 33 | Beam Search | Medium | Hypothesis expansion, pruning, eos handling | |
| 34 | Speculative Decoding | Hard | Accept/reject, draft model acceleration |
| # | Problem | Difficulty | Freq | Key Concepts |
|---|---|---|---|---|
| 35 | BPE Tokenizer | Hard | Byte-pair encoding, merge rules, subword splits | |
| 36 | INT8 Quantization | Hard | Per-channel quantize, scale/zero-point | |
| 37 | DPO Loss | Hard | Direct preference optimization, alignment training | |
| 38 | GRPO Loss | Hard | Group relative policy optimization, RLAIF | |
| 39 | PPO Loss | Hard | PPO clipped surrogate loss, policy gradient |