All Problems Description Template Solution

DPO Loss

Direct preference optimization, alignment training

Hard Advanced

Problem Description

Implement the Direct Preference Optimization loss โ€” the standard loss for LLM alignment.

$$\mathcal{L}_{\text{DPO}} = -\log \sigma\Big(\beta \big[\log\frac{\pi(y_w)}{\text{ref}(y_w)} - \log\frac{\pi(y_l)}{\text{ref}(y_l)}\big]\Big)$$

Signature

def dpo_loss(policy_chosen_logps, policy_rejected_logps, ref_chosen_logps, ref_rejected_logps, beta=0.1) -> Tensor: # All inputs: (B,) log-probabilities # Returns: scalar loss

Template

Implement the function below. Use only basic PyTorch operations.

# โœ๏ธ YOUR IMPLEMENTATION HERE def dpo_loss(policy_chosen_logps, policy_rejected_logps, ref_chosen_logps, ref_rejected_logps, beta=0.1): pass # -log(sigmoid(beta * (chosen_reward - rejected_reward)))

Test Your Implementation

Use this code to debug before submitting.

# ๐Ÿงช Debug chosen = torch.tensor([0.0, 0.0]) rejected = torch.tensor([-5.0, -5.0]) ref_c = torch.tensor([-1.0, -1.0]) ref_r = torch.tensor([-1.0, -1.0]) print('Loss:', dpo_loss(chosen, rejected, ref_c, ref_r, beta=0.1).item())

Reference Solution

Try solving it yourself first! Click below to reveal the solution.

# โœ… SOLUTION def dpo_loss(policy_chosen_logps, policy_rejected_logps, ref_chosen_logps, ref_rejected_logps, beta=0.1): chosen_rewards = beta * (policy_chosen_logps - ref_chosen_logps) rejected_rewards = beta * (policy_rejected_logps - ref_rejected_logps) return -F.logsigmoid(chosen_rewards - rejected_rewards).mean()

Tips

Run Locally

For interactive practice with auto-grading, run TorchCode locally:
pip install torch-judge then use check("dpo_loss")

Key Concepts

Direct preference optimization, alignment training

DPO Loss

Description Template Test Solution Tips