THE SHIFTFrom prompted chains to trained reasoning
In 2022, chain-of-thought was a prompting trick — add 'think step by step' and accuracy jumped. The 2024–25 generation (OpenAI's o-series, DeepSeek-R1, Claude's extended thinking) made it a trained behavior: models are optimized with reinforcement learning to produce long private reasoning before answering. The chain of thought stopped being a user technique and became part of the model.
RLVRVerifiable rewards: the answer key replaces the taste-tester
Reinforcement Learning from Verifiable Rewards is the engine. For math, the reward is 'did the final answer match'; for code, 'did the tests pass.' No human labelers, no learned reward model to fool — a deterministic checker grades millions of attempts. DeepSeek-R1-Zero's headline result: pure RL on verifiable rewards, with no supervised reasoning examples at all, taught a base model to reflect, backtrack, and self-correct. The behaviors emerged because they won reward.