Group Relative Policy OptimizationPPO without the critic — DeepSeek's key innovation
GRPO (Shao et al. 2024) eliminates the critic (value function) from PPO by estimating the baseline from the average reward of a group of responses generated for the same prompt. Removes one entire trained model from PPO's 4-model setup, saving roughly 25–50% VRAM. One caveat: Dr. GRPO later showed the divide-by-std term in the advantage biases response length, so the field often drops it.
PPO — 4 modelsPolicy (trained) + Reference (frozen) + Reward model (frozen) + Critic (trained)
The critic predicts future reward from each intermediate state — a hard function to learn for text. Requires 80 GB+ VRAM for a 7B model.
GRPO — 3 models (no critic)Policy (trained) + Reference (frozen) + Reward model (frozen)
The baseline is estimated from the group of responses themselves. No learned value function needed. Roughly 25–50% less VRAM than PPO.
GRPO algorithm — step by step
Step 1 — Group sampling For each prompt q, generate G outputs {o₁, o₂, … oG} from the current policy. G=64 in DeepSeekMath. Larger G = more stable advantage estimate. Step 2 — Score each output Run each output through the reward model or verifier. Get rewards r₁, r₂, … rG. For math: 1 if answer correct, 0 otherwise. Step 3 — Compute advantage Â_i = (r_i − mean(r)) / std(r) Â > 0: above average → reinforce. Â < 0: below average → suppress. Â ≈ 0: average → no update. Step 4 — Clipped update L = min(ratio × Â, clip(ratio, 1−ε, 1+ε) × Â) − β × KL(π_θ ‖ π_ref). ratio = π_θ/π_old. ε=0.2, β=0.04.
Why group sampling works as a baselineA value function must predict future reward from intermediate text — notoriously hard. GRPO sidesteps this by observing multiple complete trajectories for the same prompt. The group average reward is a Monte Carlo estimate of expected reward — simpler, more direct, needing no separate network. Key: G ≥ 16, typically 64.
DeepSeekMath hyperparametersG = 64 outputs per prompt · Batch = 1,024 · Learning rate = 1e-6 · KL coefficient = 0.04 · Max sequence length = 1,024 tokens · Single policy update per exploration stage · Training data: GSM8K + MATH (chain-of-thought format).