All Problems Description Template Solution

GRPO Loss

Group relative policy optimization, RLAIF

Hard Advanced

Problem Description

Implement the Group Relative Policy Optimization (GRPO) loss โ€” a group-wise, baseline-subtracted REINFORCE objective commonly used in RLAIF (reinforcement learning from AI feedback).

Given a batch of log-probabilities, scalar rewards, and group ids (one group per prompt), define the within-group normalized advantages:

$$A_i = \frac{r_i - \bar r_{g(i)}}{\text{std}_{g(i)} + \epsilon}$$

where \(\bar r_{g(i)}\) and \(\text{std}_{g(i)}\) are the mean and standard deviation of rewards in the group of example \(i\).

The GRPO loss is then the negative advantage-weighted log-probability:

$$\mathcal{L}_{\text{GRPO}} = -\mathbb{E}_i \big[\,\text{stop\_grad}(A_i)\, \log \pi_\theta(y_i)\big].$$

Signature

from torch import Tensor def grpo_loss(logps: Tensor, rewards: Tensor, group_ids: Tensor, eps: float = 1e-5) -> Tensor: """GRPO loss over a batch. logps: (B,) policy log-probs for each sampled response rewards: (B,) scalar rewards for each response group_ids: (B,) integers, same id = same prompt/group returns: scalar loss (Tensor) """

Template

Implement the function below. Use only basic PyTorch operations.

# โœ๏ธ YOUR IMPLEMENTATION HERE from torch import Tensor def grpo_loss(logps: Tensor, rewards: Tensor, group_ids: Tensor, eps: float = 1e-5) -> Tensor: pass # compute normalized advantages per group and return -mean(adv.detach() * logps)

Test Your Implementation

Use this code to debug before submitting.

# ๐Ÿงช Debug logps = torch.tensor([0.0, -0.5, -1.0, -1.5]) rewards = torch.tensor([1.0, 0.8, 0.2, 0.0]) group_ids = torch.tensor([0, 0, 1, 1]) print('Loss:', grpo_loss(logps, rewards, group_ids).item())

Reference Solution

Try solving it yourself first! Click below to reveal the solution.

# โœ… SOLUTION def grpo_loss(logps: Tensor, rewards: Tensor, group_ids: Tensor, eps: float = 1e-5) -> Tensor: """Group Relative Policy Optimization (GRPO) loss. logps: (B,) policy log-probs for each sampled response rewards: (B,) scalar rewards for each response group_ids: (B,) integers, same id = same prompt/group returns: scalar loss (Tensor) """ # Compute per-group normalized advantages A_i unique_ids = group_ids.unique() advantages = torch.empty_like(rewards) for gid in unique_ids: mask = group_ids == gid r_g = rewards[mask] mean_g = r_g.mean() std_g = r_g.std(unbiased=False) advantages[mask] = (r_g - mean_g) / (std_g + eps) # Stop gradient through advantages advantages_detached = advantages.detach() # GRPO objective: -E[A_i * logpi_i] return -(advantages_detached * logps).mean()

Tips

Run Locally

For interactive practice with auto-grading, run TorchCode locally:
pip install torch-judge then use check("grpo_loss")

Key Concepts

Group relative policy optimization, RLAIF

GRPO Loss

Description Template Test Solution Tips