TL;DR
- GRPO (DeepSeek, 2024, introduced in DeepSeekMath, arXiv:2402.03300) replaces PPO's learned value network with a group-based baseline computed from multiple sampled responses to the same prompt.
- For each prompt, sample G responses; score them; compute advantages as standardised rewards within the group (subtract mean, divide by standard deviation).
- Eliminates one of the four models in PPO RLHF, cutting memory and complexity. Better-behaved than PPO when the reward landscape is sparse or skewed.
- Used by DeepSeekMath, DeepSeek-V3 and most prominently in DeepSeek-R1 — the open-weights reasoning model that demonstrated GRPO's strength on verifiable-reward training.
Motivation#
Standard PPO needs an estimate of the advantage A_t at each step — how much better the action taken was than the average. The advantage is computed via a value network V_φ(s) trained alongside the policy. The value network adds memory (a fourth model in RLHF), training instability (two networks chasing a moving target), and engineering complexity.
On many RLHF tasks the value network's predictions are noisy and provide weak signal. GRPO asks: can we estimate the advantage directly from rewards, without a learned value network?
The Group Baseline#
GRPO's answer: sample G responses to the same prompt, score them all with the reward model, and define the advantage of response i as A_i = (r_i − mean(r)) / std(r), where the mean and standard deviation are over the group.
This is a pure Monte Carlo estimate. Responses better than the group average get positive advantage; worse get negative. The standardisation gives a roughly unit-variance signal regardless of the absolute reward scale.
Memory cost drops by one model (no value network). The cost is a G× increase in rollout compute per gradient step — sampling more responses per prompt — which is acceptable because inference is cheap relative to training in modern stacks.
def grpo_advantage(rewards):
# rewards shape: (batch, G)
mean = rewards.mean(dim=-1, keepdim=True)
std = rewards.std(dim=-1, keepdim=True) + 1e-8
return (rewards - mean) / stdThe Full Update#
The policy loss combines GRPO advantages with PPO's clipped surrogate objective and a KL penalty against the reference model:
The KL penalty is computed token-by-token in DeepSeekMath; later work (DeepSeek-V3, DeepSeek-R1) uses sequence-level KL with similar effect.
- For each prompt, sample G responses from the current policy (or a slightly stale snapshot).
- Score each response with the reward model.
- Compute group-relative advantages.
- Update the policy with the PPO-style clipped objective using those advantages, plus a KL penalty against the reference.
Why It Worked So Well for Reasoning#
DeepSeek-R1 (January 2025) was the first open-weights model to demonstrate near-frontier reasoning via large-scale RL with verifiable rewards. GRPO was the optimiser at the heart of that pipeline: programmatic rewards (math correctness, code unit tests) replaced the reward model, and GRPO turned those rewards into a stable training signal.
The choice was load-bearing. Verifiable-reward training produces sparse, all-or-nothing rewards — solving a math problem is binary. A value network trained on this signal is noisy. GRPO's group baseline naturally normalises this: in a group of G attempts, the relative ranking is what matters, not the absolute reward.
If you are doing RL on verifiable rewards (math, code tests, agent task success), GRPO is now the default starting point. PPO with a value network struggles with sparse binary rewards.
Hyperparameters and Caveats#
Typical GRPO settings use G = 8-64 samples per prompt, β KL ≈ 0.001-0.04, and PPO clip ratio 0.2. Group size G trades off variance reduction (larger G) against compute (more rollouts per update).
GRPO inherits PPO's importance-sampling sensitivity: if the policy drifts too far from the rollout policy in a single update, the clipped surrogate underperforms. Periodic policy snapshots and KL monitoring are required.
Adoption#
Outside DeepSeek's own family, GRPO has been adopted broadly: Qwen-Math, Qwen3, Llama-Math fine-tunes, and many open-source reasoning-model recipes. Hugging Face's TRL library ships a GRPO trainer; verl and other RL frameworks support it natively. As of 2026 it is arguably the dominant RL algorithm for LLM post-training on verifiable rewards.