Foundations Module 11 10 min ▶ Narrated

Why next-token works.

Why predicting the next token produces emergent reasoning, in-context learning, arithmetic.

Prerequisites·21 · Tokens + 12 · Attention Modalities
Next-Token hero illustration
Narration · Module 11
Next-Token
0:00 / –:––
§ 01

The core confusion

The distinction that matters

Training objective ≠ emergent capability ≠ deployed purpose

These three things are often conflated — and conflating them causes enormous confusion.

Training objective: predict the next token. This is the loss function. The mechanism by which weights are updated. It is a mathematical operation, not a description of what the model can do.

Emergent capability: reasoning, coding, translation, summarisation, instruction following. These were never explicitly trained for. They emerged as side effects of doing next-token prediction at scale on a sufficiently rich and diverse corpus.

Deployed purpose: be a useful assistant, coding partner, reasoning engine, or agent. This is what the model is for — achieved through SFT and RLHF on top of the pretrained base.
Training objective
Predict the next token. A mathematical loss. The vehicle, not the destination.
Emergent capability
Reasoning, coding, translation — side effects of doing prediction at scale.
Deployed purpose
Useful assistant, agent, tool. Achieved via SFT + RLHF on top of the base model.
Next-Token spotlight illustration
§ 02

What the objective is

The objective in full

Next-token prediction: the most powerful self-supervised objective ever discovered

The objective is deceptively simple: given all tokens so far, assign a probability to every possible next token. Minimise cross-entropy loss. Repeat on every position in every document in a corpus of trillions of words.

What makes this objective so powerful is what it implicitly requires. To consistently predict the next word in a physics textbook, a model must represent physics. To predict the next line of Python code, it must understand code. To predict the next word in a conversation, it must model the speaker's intent. The objective forces the model to build an internal model of the world from which text is generated — not just to memorise patterns.
Traditional supervised learning
Needs human-labelled examples for every task. "This image contains a cat." "This email is spam." Requires massive annotation effort per task. Cannot generalise beyond labelled categories. Scaling requires more human work proportionally.
Next-token prediction
Zero human labels needed. Any text is training data. The text itself is the supervision — the next word is always known. Scales perfectly: more data = just download more internet. The model learns to do thousands of tasks as a side effect of learning to predict text.
The compression argument

To predict well, you must compress the world into weights

Predicting the next token in a corpus of trillions of words is an extreme compression problem. A model that merely memorises could not generalise to new sentences. A model that truly predicts must extract the deep regularities — the grammar, the facts, the causal structures, the social norms — that generate the text. The lower the loss, the richer the internal representation. In this sense, a well-trained language model is a compressed model of the world, constructed entirely from the statistics of human writing.

This is why Ilya Sutskever (OpenAI co-founder) argued: "To predict the next token well, you need to understand the world that produced that text."
§ 03

Emergence simulator

Emergent capabilities

Abilities that appear suddenly at scale — and were never explicitly trained for

Emergence in AI refers to capabilities that are not present in small models but appear abruptly and unpredictably as model size increases. These abilities were not programmed, not listed as training objectives, and not present in scaled-down versions of the same architecture. They arise from the interaction of scale, data diversity, and the pressure of predicting text well. The simulator below shows what a model at different training scales can and cannot do.
Why emergence is surprising

Loss improves smoothly — but capability appears suddenly

This is what makes emergence scientifically fascinating. If you plot cross-entropy loss against training compute, you get a smooth power-law curve — completely predictable. But if you plot "can the model do 3-digit addition?" against compute, you get a flat line at zero... then a sudden jump to near-perfect performance. The capability was not gradually improving — it crossed some threshold and appeared.

Wei et al. (2022) documented this across 137 tasks and 8 model families. An important caveat from Schaeffer et al. (2023): much of the measured "emergence" is an artifact of the metric — exact-match scoring makes gradual improvement look like a sudden jump, and under continuous metrics many of the same capabilities improve smoothly. Some genuine discontinuities may remain, but the "mirage" result means emergence claims should always be checked against the choice of metric.
§ 04

In-context learning

In-context learning

The ability that shouldn't exist — learning from examples without updating any weights

Standard machine learning requires training: you show the model examples, run backpropagation, and update the weights. In-context learning (ICL) is different: the model is shown examples inside the prompt itself — as text — and immediately generalises to new examples without any weight updates whatsoever. No backpropagation. No gradient. The model "learns" from context that is just tokens in the input. This emerged from GPT-3 and shocked the research community.
Zero-shot: no examples at all
Just ask the model to do something. "Translate this to French." "Summarise this article." "Solve this maths problem." At sufficient scale, models can perform many tasks zero-shot — purely from the instruction and their training. GPT-3 showed this for the first time at scale. ChatGPT's conversational ability is largely zero-shot generalisation from instruction tuning.
Few-shot: examples in the prompt
Provide 3–10 (input, output) examples before the actual question. The model adapts to the pattern immediately. GPT-3's few-shot results matched or exceeded fine-tuned models on many benchmarks — without updating a single weight. This was the key result that made LLMs practically useful: no task-specific training needed for many tasks.
Chain-of-thought prompting

Showing the model how to think, not just what to answer

Wei et al. (2022) discovered that including reasoning steps in few-shot examples dramatically improves performance on complex tasks. Instead of showing (question, answer) pairs, you show (question, step-by-step reasoning, answer) pairs. The model learns to generate its own reasoning chains before answering — and this dramatically improves accuracy on maths, logic, and multi-step problems.

This works because reasoning is text. If the model can predict text well, and reasoning appears in training text, then the model has learned to generate reasoning. The chain-of-thought examples in the prompt simply activate this latent ability. In the discovery era, no training was required — just prompting.
2026

Reasoning is now trained, not just prompted

Chain-of-thought is no longer only an emergent trick you elicit with clever prompts. Since late 2024, reasoning has been trained directly: models like OpenAI's o-series, DeepSeek-R1, and Claude's extended thinking are post-trained with reinforcement learning on verifiable rewards (RLVR) — maths and code problems where correctness can be checked automatically — so the model learns to produce long, useful reasoning chains on its own. This also changed the scaling story: alongside scaling pretraining compute, the field now scales inference-time compute — letting a model think longer at answer time buys accuracy, the successor to the pure scaling-law narrative.
Then
Chain-of-thought was elicited by few-shot prompting (2022–23) — a latent ability you activated with examples, no training required.
Now · June 2026
Chain-of-thought is trained directly via RL on verifiable rewards (o-series, DeepSeek-R1, Claude extended thinking). "No training required" describes the discovery era, not current practice.
§ 05

The iceberg

The iceberg metaphor

Next-token prediction is the visible tip — a vast structure of capability lies beneath

Think of an iceberg. What is visible above the waterline is the training objective: predict the next token. What is beneath the waterline — invisible but responsible for the entire structure — is the knowledge, reasoning capacity, and world model that the model must build in order to predict well. The objective is the surface. The capability is the depth.
The world model hypothesis
Some researchers (Sutskever, Hinton, others) argue that a sufficiently capable language model has implicitly built a "world model" — an internal representation of causal structure, physical laws, social dynamics, and factual knowledge. The evidence: LLMs can answer counterfactual questions ("what would happen if..."), perform analogical reasoning, and generalise to tasks they were never explicitly trained on. The most prominent critic is Yann LeCun, who argues that text-only prediction cannot yield a genuine world model — that requires grounding in perception and action.
The stochastic parrot counterargument
Bender et al. (2021) argued that LLMs are "stochastic parrots" — sophisticated pattern matchers that reproduce statistical regularities without genuine understanding. The lack of grounding in perception and action means there is no meaning behind the tokens. The debate is unresolved — and one of the most important open questions in AI research.
§ 06

Misconceptions

Common misconceptions

Six things people say about LLMs that are wrong — or at least incomplete

These misconceptions are everywhere — in news articles, classroom discussions, and even some technical papers. They come from conflating the training objective with the model's capability, or from misunderstanding what "prediction" means at this scale.
The analogy: how humans learn

A child learns language by predicting — not because prediction is the goal

A human child learning to speak is implicitly doing something similar. They hear language, build internal models of what words mean and what can follow what. This learning pressure builds their conceptual understanding of the world. You would not say a child's "purpose" is to predict the next word in a sentence — even though that is roughly the training signal that drove language acquisition. The same applies to LLMs. The training objective is a means to an end. The end is a rich, useful model of how language and the world work together.
§ 07

Further reading.

Done with Next-Token?
Mark it complete — progress is saved in your browser and shows on the course map.