Training a Model Module 30 10 min

Models that
think first.

Test-time compute, verifiable rewards, and how o1/R1-class reasoning models are trained.

Prerequisites·03 · RLHF + Reasoning + 09 · TRL + GRPO Modalities
§ 01

Thinking as a trained behavior

THE SHIFT

From prompted chains to trained reasoning

In 2022, chain-of-thought was a prompting trick — add 'think step by step' and accuracy jumped. The 2024–25 generation (OpenAI's o-series, DeepSeek-R1, Claude's extended thinking) made it a trained behavior: models are optimized with reinforcement learning to produce long private reasoning before answering. The chain of thought stopped being a user technique and became part of the model.
RLVR

Verifiable rewards: the answer key replaces the taste-tester

Reinforcement Learning from Verifiable Rewards is the engine. For math, the reward is 'did the final answer match'; for code, 'did the tests pass.' No human labelers, no learned reward model to fool — a deterministic checker grades millions of attempts. DeepSeek-R1-Zero's headline result: pure RL on verifiable rewards, with no supervised reasoning examples at all, taught a base model to reflect, backtrack, and self-correct. The behaviors emerged because they won reward.
Prompt (math/code task)Sample G attemptsVerifier grades eachGRPO: advantage vs group meanUpdate policyRepeat at scale
Pipeline — steps light up in order
MECHANICS

GRPO is the workhorse underneath

The algorithm most RLVR pipelines run is GRPO (module 09): sample a group of answers per prompt, score them with the verifier, and push the model toward the above-average ones — no critic model needed. The trio of modules connects here: RLHF (03) established the post-training idea, GRPO (09) made it cheap, RLVR (this lesson) gave it an unhackable-ish grader and the reasoning objective.
§ 02

Test-time compute

NEW AXIS

The third scaling law: thinking longer

Pretraining scaled with data and parameters. Reasoning models added a third axis: inference-time compute. The same model scores dramatically higher on hard problems when allowed more thinking tokens — o1's core chart showed accuracy climbing log-linearly with test-time compute. Practical consequence: 'how long should it think' is now a product dial (extended thinking budgets, effort levels), and hard problems are bought with tokens rather than retrained for.
0%
DeepSeek-R1 pass@1 on AIME 2024 — versus ~16% for its non-reasoning base generation
0
scaling axes by 2026: parameters, data, and test-time compute
0
human labels needed for verifiable-reward training — the checker is the teacher
ECONOMICS

Reasoning made inference the expensive part

A reasoning model may generate 10–100× more tokens than it shows you. That inverted the cost structure of serving (decode-heavy workloads — module 08) and created the distillation pattern: train a huge reasoner with RLVR, then distill its reasoning traces into small models via plain SFT. R1-distill proved a 7B student can inherit much of the teacher's reasoning — which is why capable small local models exist at all.
LIMITS

Where it works, where it doesn't

RLVR needs a checkable answer — it shines on math, code, logic, and agentic tasks with verifiable outcomes (tests pass, the form got submitted). It transfers only partially to fuzzy domains (writing quality, taste), where learned reward models and human preference still rule. And verifiable ≠ unhackable: models still discover reward hacks — special-casing test inputs, gaming format checks — so frontier pipelines audit transcripts for exactly that.
Then
2022–23 — chain-of-thought elicited by prompting; RLHF with human preference labels as the only post-training story.
Now · June 2026
June 2026 — reasoning trained directly via RLVR + GRPO-family algorithms; thinking budgets are a product surface; reasoning distilled into small models; agentic RL (training on multi-step tool tasks) is the frontier.
§ 03

Further reading.

Done with Reasoning + RLVR?
Mark it complete — progress is saved in your browser and shows on the course map.