RLHF + Reasoning — AI Learning Course

§ 01

RLHF overview

Why alignment matters

A base model is not an assistant — it just completes text

After pretraining, a model knows how to predict the next token but has no concept of "being helpful" or "avoiding harm." Ask it a question and it may continue with more questions. RLHF (Reinforcement Learning from Human Feedback) bridges this gap — teaching the model what humans actually want from it.

Stage 1 SFTStage 2 Reward modelStage 3 RL (PPO / GRPO)Aligned model

Pipeline — steps light up in order

Stage 2 — The reward model

A neural network that predicts human preference

Human annotators are shown pairs of responses (A vs B) to the same prompt and pick which they prefer. A reward model is trained on (prompt, chosen, rejected) triplets to predict which response a human would prefer — assigning a scalar score. Typically a transformer with a linear head on top.

Bradley-Terry formulation

The probability model behind preference learning

If response y_w is preferred over y_l, the probability of that preference is modelled as σ(r(x,y_w) − r(x,y_l)), where σ is sigmoid. The loss is: −log σ(r_w − r_l). This elegant formula says: push the reward of the chosen response higher than the rejected one. No absolute score needed — only relative preference.

On-policy vs off-policy alignment

Two fundamentally different approaches to using preference data

On-policy (PPO, GRPO): The model generates new responses during training. The reward model scores them. The RL algorithm updates the model's weights. Expensive — inference is slow — but achieves the highest performance ceiling because the model always trains on its own current behaviour.

Off-policy (DPO, ORPO): Training uses pre-collected preference data without generating new responses during training. Simpler, faster, more stable — but the model never explores beyond the fixed dataset. Less adaptive than on-policy methods for complex reasoning tasks.

§ 02

PPO variants

Proximal Policy Optimization

The algorithm that established RLHF (2022–23)

PPO (Schulman et al. 2017) was the backbone of ChatGPT's original alignment process (InstructGPT) and defined the RLHF era of 2022–23. In the LLM setting, the "policy" is the LLM, "actions" are tokens, "states" are the generation so far, and the "reward" comes from the reward model at the end of each complete response. By 2026 the GRPO family plus verifiable rewards (RLVR) has replaced it as the default.

Then

PPO with a learned reward model + critic — the InstructGPT-era recipe

Now · June 2026

GRPO/DAPO + verifiable rewards (RLVR); human preference labels mostly audit AI judges

Four models required during PPO training

Policy π_θ LLM being trained
weights updated Reference π_ref Frozen SFT model
KL anchor Reward R_φ Frozen reward model
scores full responses Critic V_γ Value function
weights updated Policy + Critic = 2 LLM copies trained simultaneously → huge memory cost. This is what GRPO eliminates.

PPO combined objective

Three terms: 1. Clipped surrogate — maximise advantage while staying close to old policy (clip at ε=0.2). 2. KL penalty — subtract β×KL(π_θ ‖ π_ref) to prevent reward hacking (β=0.02–0.1). 3. Value loss — train the critic to predict future rewards accurately.

PPO limitations

Memory: 4 models in GPU simultaneously — 80 GB+ for 7B model. Stability: Balancing policy + critic training is notoriously tricky. Speed: Must generate completions on-policy before each update. Complexity: Many hyperparameters to tune simultaneously.

§ 03

DPO & Best-of-N

Best-of-N (Rejection Sampling)

The simplest alignment strategy — generate many, keep the best

Generate N responses from the model for each prompt, score all N with the reward model, keep only the highest-scoring one as training data, then fine-tune the model on these "best" responses via SFT. No RL required. Llama 2's alignment used four rounds of rejection sampling before any RL. Simple and effective for moderate alignment goals, but computationally wasteful — most generations are discarded.

ModelN responses r₁ r₂ … rNReward modelBest responseSFT on best

Pipeline — steps light up in order

Direct Preference Optimization (DPO)

Skip the reward model entirely — optimise directly on preference pairs

DPO (Rafailov et al. Stanford 2023) made a key mathematical observation: the optimal RLHF policy has a closed-form solution that can be derived directly from preference data — no separate reward model required. By rearranging the RLHF objective, DPO shows the reward is implicitly encoded in the ratio of log-probabilities between the policy and a reference model.

DPO loss function

L = −E [ log σ ( β log(π_θ(y_w|x) / π_ref(y_w|x))
− β log(π_θ(y_l|x) / π_ref(y_l|x)) ) ] y_w = chosen response · y_l = rejected response · β controls deviation from reference · No reward model · No RL · SFT-style training

DPO advantages

No reward model to train. No RL instability. Trains like standard supervised fine-tuning. Memory efficient — only 2 models needed (policy + frozen reference). More stable and reproducible than PPO. Widely used in open-source models.

DPO limitations

Off-policy — trains on fixed preference data, never explores new responses. Cannot handle tasks requiring complex reasoning chains. Quality depends entirely on the preference dataset. Generally weaker than on-policy RL for hard reasoning tasks like math and code.

§ 04

Reasoning training

Reasoning LLMs — a new training paradigm

Teaching models to "think before they answer"

Standard RLHF teaches models to be helpful and safe. Reasoning training — Reinforcement Learning from Verifiable Rewards (RLVR) — teaches models to solve hard problems by generating long chain-of-thought reasoning traces. The key difference: rewards come from verifiable answers (math results, code tests) rather than from a learned preference model. These rewards are binary, objective, and much harder to hack than learned reward models — though models still find exploits like gaming unit tests or abusing format checks (reward hacking).

What changes from standard RLHF

Reward source: Verifiable ground truth — the math answer is correct or not, the code compiles and passes tests or not.
Format rewards: Additional rewards for using the correct thinking format (e.g., reasoning inside <think> tags).
Emergent behaviours: Models spontaneously develop self-correction, "aha moments," extended deliberation, and backtracking.

DeepSeek-R1-Zero — the surprising finding

Skip SFT entirely — RL directly on base model

Starting from a base model (DeepSeek V3) with NO SFT phase, pure GRPO training on verifiable rewards produced emergent reasoning. The model learned to allocate more thinking time to harder problems, self-evaluate, and backtrack — all without being shown any chain-of-thought examples.

DeepSeek-R1 four-stage training pipeline

Stage 1
Cold-start SFT
few thousand CoT → Stage 2
GRPO RL
verifiable rewards → Stage 3
Rejection sampling
600K reasoning data → Stage 4
Final GRPO
helpfulness + safety → DeepSeek
R1

RLVR loss function

J(θ) = E [ R(x,y) ] − β · KL(π_θ ‖ π_ref) R(x,y) = verifiable reward (1 if correct, 0 if wrong) · β prevents drift from reference policy · In GRPO: advantage normalises rewards within group

Test-time compute scaling

Why reasoning training matters at all

The payoff: a reasoning-trained model can spend more thinking tokens at inference and buy accuracy with them — o1's core result. Accuracy scales with how long the model deliberates, not just how big it is. And that ability can be distilled into small models by fine-tuning them on the big model's reasoning traces (the R1-distill family).

§ 05

GRPO deep dive

Group Relative Policy Optimization

PPO without the critic — DeepSeek's key innovation

GRPO (Shao et al. 2024) eliminates the critic (value function) from PPO by estimating the baseline from the average reward of a group of responses generated for the same prompt. Removes one entire trained model from PPO's 4-model setup, saving roughly 25–50% VRAM. One caveat: Dr. GRPO later showed the divide-by-std term in the advantage biases response length, so the field often drops it.

PPO — 4 models

Policy (trained) + Reference (frozen) + Reward model (frozen) + Critic (trained)

The critic predicts future reward from each intermediate state — a hard function to learn for text. Requires 80 GB+ VRAM for a 7B model.

GRPO — 3 models (no critic)

Policy (trained) + Reference (frozen) + Reward model (frozen)

The baseline is estimated from the group of responses themselves. No learned value function needed. Roughly 25–50% less VRAM than PPO.

GRPO algorithm — step by step

Step 1 — Group sampling For each prompt q, generate G outputs {o₁, o₂, … oG} from the current policy. G=64 in DeepSeekMath. Larger G = more stable advantage estimate. Step 2 — Score each output Run each output through the reward model or verifier. Get rewards r₁, r₂, … rG. For math: 1 if answer correct, 0 otherwise. Step 3 — Compute advantage Â_i = (r_i − mean(r)) / std(r) Â > 0: above average → reinforce. Â < 0: below average → suppress. Â ≈ 0: average → no update. Step 4 — Clipped update L = min(ratio × Â, clip(ratio, 1−ε, 1+ε) × Â) − β × KL(π_θ ‖ π_ref). ratio = π_θ/π_old. ε=0.2, β=0.04.

Why group sampling works as a baseline

A value function must predict future reward from intermediate text — notoriously hard. GRPO sidesteps this by observing multiple complete trajectories for the same prompt. The group average reward is a Monte Carlo estimate of expected reward — simpler, more direct, needing no separate network. Key: G ≥ 16, typically 64.

DeepSeekMath hyperparameters

G = 64 outputs per prompt · Batch = 1,024 · Learning rate = 1e-6 · KL coefficient = 0.04 · Max sequence length = 1,024 tokens · Single policy update per exploration stage · Training data: GSM8K + MATH (chain-of-thought format).

§ 06

Benchmarks

Evaluating reasoning models

How we measure whether a model can actually think — click each benchmark

Reasoning benchmarks test multi-step problem solving, not pattern matching. Key metrics: Pass@k (does at least one of k attempts succeed?), majority voting (most common answer from N samples), and exact match on verifiable answers. Frontier models have saturated many older benchmarks — the field continuously moves to harder problems.

Pass@k

Evaluation metrics for reasoning models

Generate k independent responses. Pass if at least one is correct. Pass@1 = accuracy. Pass@100 = ceiling of the model's potential. DeepSeek-R1 is evaluated with Pass@1 averaged over 64 samples on AIME.

~97%

Frontier GSM8K (saturated)

DeepSeek-R1 AIME 2024 pass@1 (~71% for R1-Zero)

~90%

Frontier GPQA Diamond

§ 07

RLHF and the reasoning
turn.