When an AI finishes its "Base Training" (reading the entire internet), it becomes like an extremely talented Master Chef who can cook literally anything—including poison, dirt, or burnt toast! If you ask a raw base model a question, it might insult you or just give you internet spam.
To fix this, scientists use RLHF (Reinforcement Learning from Human Feedback). They hire a "Taste Tester." The Chef prepares two meals, and the Taste Tester decides which one is better. Over time, the Chef learns exactly what humans like, and refuses to cook the poison.
Interactive: Be the Taste Tester!
You are the Human Feedback node. The AI will give you two responses. Click the one that is most helpful and safe. Watch how your feedback causes the AI's internal behavioral stats to drop its toxicity!
The Reward Model
Automating the human
Humans are slow and expensive, so companies don't use them to train the final model directly. Instead, they use the human clicks to train a Reward Model—a smaller "judge" AI that mimics human preferences.
Once the Reward Model is trained, it can act as the Taste Tester 10,000 times a second, rapidly teaching the main AI how to behave.
DPO
Direct Preference Optimization
In 2023, scientists invented a way to skip the complicated "Reward Model" entirely. DPO (Direct Preference Optimization) is a mathematical shortcut that takes the human's "Choice A over Choice B" and shoves it directly into the AI's brain in one step.
It is much more stable, requires fewer computers, and became the favorite of open-source builders because it's so simple! (The big frontier labs still mostly use the full taste-tester loop — on-policy RL.)
And for math and code, the newest AIs skip the taste-tester entirely: they check their work against an answer key instead — scientists call this "verifiable rewards."
§ 02
The playground.
Theory above, instrument below. This interactive panel runs live in the page — drag, type, and watch the mechanism respond.