Scaled Dot-Product Attention

Problem Description

Implement the core attention mechanism used in Transformers.

$$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Signature

def scaled_dot_product_attention(
    Q: torch.Tensor,  # (batch, seq_q, d_k)
    K: torch.Tensor,  # (batch, seq_k, d_k)
    V: torch.Tensor,  # (batch, seq_k, d_v)
) -> torch.Tensor:   # (batch, seq_q, d_v)
    ...

Rules

• Do NOT use F.scaled_dot_product_attention

• You may use torch.softmax and torch.bmm

• Must support autograd

• Must handle cross-attention (seq_q ≠ seq_k)

Template

Implement the function below. Use only basic PyTorch operations.

# ✏️ YOUR IMPLEMENTATION HERE

def scaled_dot_product_attention(Q, K, V):
    pass  # Replace this

Test Your Implementation

Use this code to debug before submitting.

# 🧪 Debug
torch.manual_seed(42)
Q = torch.randn(2, 4, 8)
K = torch.randn(2, 4, 8)
V = torch.randn(2, 4, 8)

out = scaled_dot_product_attention(Q, K, V)
print("Output shape:", out.shape)          # should be (2, 4, 8)
print("Has NaN?    ", torch.isnan(out).any().item())  # should be False
print("Has Inf?    ", torch.isinf(out).any().item())  # should be False

# Cross-attention: seq_q != seq_k
Q2 = torch.randn(1, 3, 16)
K2 = torch.randn(1, 5, 16)
V2 = torch.randn(1, 5, 32)
out2 = scaled_dot_product_attention(Q2, K2, V2)
print("Cross-attn shape:", out2.shape)     # should be (1, 3, 32)

Reference Solution

Try solving it yourself first! Click below to reveal the solution.

# ✅ SOLUTION

def scaled_dot_product_attention(Q, K, V):
    d_k = K.size(-1)
    scores = torch.bmm(Q, K.transpose(1, 2)) / math.sqrt(d_k)
    weights = torch.softmax(scores, dim=-1)
    return torch.bmm(weights, V)

Tips

Run Locally

For interactive practice with auto-grading, run TorchCode locally:
pip install torch-judge then use check("attention")

Key Concepts

softmax(QK^T/sqrt(d_k))V, the foundation