DPO Loss

Direct preference optimization, alignment training

Hard Advanced

Problem Description

Implement the Direct Preference Optimization loss — the standard loss for LLM alignment.

$$\mathcal{L}_{\text{DPO}} = -\log \sigma\Big(\beta \big[\log\frac{\pi(y_w)}{\text{ref}(y_w)} - \log\frac{\pi(y_l)}{\text{ref}(y_l)}\big]\Big)$$

Signature

def dpo_loss(policy_chosen_logps, policy_rejected_logps,
             ref_chosen_logps, ref_rejected_logps, beta=0.1) -> Tensor:
    # All inputs: (B,) log-probabilities
    # Returns: scalar loss

Template

Implement the function below. Use only basic PyTorch operations.

# ✏️ YOUR IMPLEMENTATION HERE

def dpo_loss(policy_chosen_logps, policy_rejected_logps,
             ref_chosen_logps, ref_rejected_logps, beta=0.1):
    pass  # -log(sigmoid(beta * (chosen_reward - rejected_reward)))

Test Your Implementation

Use this code to debug before submitting.

# 🧪 Debug
chosen = torch.tensor([0.0, 0.0])
rejected = torch.tensor([-5.0, -5.0])
ref_c = torch.tensor([-1.0, -1.0])
ref_r = torch.tensor([-1.0, -1.0])
print('Loss:', dpo_loss(chosen, rejected, ref_c, ref_r, beta=0.1).item())

Reference Solution

Try solving it yourself first! Click below to reveal the solution.

# ✅ SOLUTION

def dpo_loss(policy_chosen_logps, policy_rejected_logps,
             ref_chosen_logps, ref_rejected_logps, beta=0.1):
    chosen_rewards = beta * (policy_chosen_logps - ref_chosen_logps)
    rejected_rewards = beta * (policy_rejected_logps - ref_rejected_logps)
    return -F.logsigmoid(chosen_rewards - rejected_rewards).mean()

Tips

Run Locally

For interactive practice with auto-grading, run TorchCode locally:
pip install torch-judge then use check("dpo_loss")

Key Concepts

Direct preference optimization, alignment training

INT8 Quantization GRPO Loss

DPO Loss

Description Template Test Solution Tips