All Problems Description Template Solution

Scaled Dot-Product Attention

softmax(QK^T/sqrt(d_k))V, the foundation

Hard Attention

Problem Description

Implement the core attention mechanism used in Transformers.

$$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Signature

def scaled_dot_product_attention( Q: torch.Tensor, # (batch, seq_q, d_k) K: torch.Tensor, # (batch, seq_k, d_k) V: torch.Tensor, # (batch, seq_k, d_v) ) -> torch.Tensor: # (batch, seq_q, d_v) ...

Rules

• Do NOT use F.scaled_dot_product_attention

• You may use torch.softmax and torch.bmm

• Must support autograd

• Must handle cross-attention (seq_q ≠ seq_k)

Template

Implement the function below. Use only basic PyTorch operations.

# ✏️ YOUR IMPLEMENTATION HERE def scaled_dot_product_attention(Q, K, V): pass # Replace this

Test Your Implementation

Use this code to debug before submitting.

# 🧪 Debug torch.manual_seed(42) Q = torch.randn(2, 4, 8) K = torch.randn(2, 4, 8) V = torch.randn(2, 4, 8) out = scaled_dot_product_attention(Q, K, V) print("Output shape:", out.shape) # should be (2, 4, 8) print("Has NaN? ", torch.isnan(out).any().item()) # should be False print("Has Inf? ", torch.isinf(out).any().item()) # should be False # Cross-attention: seq_q != seq_k Q2 = torch.randn(1, 3, 16) K2 = torch.randn(1, 5, 16) V2 = torch.randn(1, 5, 32) out2 = scaled_dot_product_attention(Q2, K2, V2) print("Cross-attn shape:", out2.shape) # should be (1, 3, 32)

Reference Solution

Try solving it yourself first! Click below to reveal the solution.

# ✅ SOLUTION def scaled_dot_product_attention(Q, K, V): d_k = K.size(-1) scores = torch.bmm(Q, K.transpose(1, 2)) / math.sqrt(d_k) weights = torch.softmax(scores, dim=-1) return torch.bmm(weights, V)

Tips

Run Locally

For interactive practice with auto-grading, run TorchCode locally:
pip install torch-judge then use check("attention")

Key Concepts

softmax(QK^T/sqrt(d_k))V, the foundation

Scaled Dot-Product Attention

Description Template Test Solution Tips