Fundamentals Attention Architecture Training Inference Advanced

TorchCode

Crack the PyTorch interview. Practice implementing operators and architectures from scratch.

Like LeetCode, but for tensors. Instant feedback. Reference solutions.

40
Problems
6
Categories
0
GPU Required

Fundamentals

#ProblemDifficultyFreqKey Concepts
01 ReLU Easy Activation functions, element-wise ops
02 Softmax Easy Numerical stability, exp/log tricks
16 Cross-Entropy Loss Easy Log-softmax, logsumexp trick
17 Dropout Easy Train/eval mode, inverted scaling
18 Embedding Easy Lookup table, weight[indices]
19 GELU Easy Gaussian error linear unit, torch.erf
20 Kaiming Init Easy std = sqrt(2/fan_in), variance scaling
21 Gradient Clipping Easy Norm-based clipping, direction preservation
31 Gradient Accumulation Easy Micro-batching, loss scaling
40 Linear Regression Medium Normal equation, GD from scratch, nn.Linear
03 Linear Layer Medium y = xW^T + b, Kaiming init, nn.Parameter
04 LayerNorm Medium Normalization, running stats, affine transform
07 BatchNorm Medium Batch vs layer statistics, train/eval behavior
08 RMSNorm Medium LLaMA-style norm, simpler than LayerNorm
15 SwiGLU MLP Medium Gated FFN, SiLU(gate) * up, LLaMA/Mistral-style
22 Conv2d Medium Convolution, unfold, stride/padding

Attention

#ProblemDifficultyFreqKey Concepts
23 Cross-Attention Medium Encoder-decoder, Q from decoder, K/V from encoder
05 Scaled Dot-Product Attention Hard softmax(QK^T/sqrt(d_k))V, the foundation
06 Multi-Head Attention Hard Parallel heads, split/concat, projection matrices
09 Causal Self-Attention Hard Autoregressive masking with -inf, GPT-style
10 Grouped Query Attention Hard GQA (LLaMA 2), KV sharing across heads
11 Sliding Window Attention Hard Mistral-style local attention, O(n*w) complexity
12 Linear Attention Hard Kernel trick, O(n*d^2)
14 KV Cache Attention Hard Incremental decoding, cache K/V, prefill vs decode
24 RoPE Hard Rotary position embedding, relative position via rotation
25 Flash Attention Hard Tiled attention, online softmax, memory-efficient

Architecture

#ProblemDifficultyFreqKey Concepts
26 LoRA Medium Low-rank adaptation, frozen base + BA update
27 ViT Patch Embedding Medium Image to patches to linear projection
13 GPT-2 Block Hard Pre-norm, causal MHA + MLP (4x, GELU), residual
28 Mixture of Experts Hard Mixtral-style, top-k routing, expert MLPs

Training

#ProblemDifficultyFreqKey Concepts
29 Adam Optimizer Medium Momentum + RMSProp, bias correction
30 Cosine LR Scheduler Medium Linear warmup + cosine annealing

Inference

#ProblemDifficultyFreqKey Concepts
32 Top-k / Top-p Sampling Medium Nucleus sampling, temperature scaling
33 Beam Search Medium Hypothesis expansion, pruning, eos handling
34 Speculative Decoding Hard Accept/reject, draft model acceleration

Advanced

#ProblemDifficultyFreqKey Concepts
35 BPE Tokenizer Hard Byte-pair encoding, merge rules, subword splits
36 INT8 Quantization Hard Per-channel quantize, scale/zero-point
37 DPO Loss Hard Direct preference optimization, alignment training
38 GRPO Loss Hard Group relative policy optimization, RLAIF
39 PPO Loss Hard PPO clipped surrogate loss, policy gradient