Autoregressive masking with -inf, GPT-style
Hard AttentionImplement causal (masked) self-attention โ the attention used in GPT-style decoders.
Same as softmax attention, but each position can only attend to itself and earlier positions (no peeking at future tokens).
• Do NOT use F.scaled_dot_product_attention
• Position i can only attend to positions \le i
• You may use torch.softmax, torch.bmm, torch.triu
Implement the function below. Use only basic PyTorch operations.
Use this code to debug before submitting.
Try solving it yourself first! Click below to reveal the solution.
For interactive practice with auto-grading, run TorchCode locally:pip install torch-judge then use check("causal_attention")
Autoregressive masking with -inf, GPT-style