Softmax + CE — AI Learning Course

§ 01

What problem they solve

The big picture

The output head must answer one question: what word comes next?

After the transformer processes your input through N attention blocks, you have a vector of numbers (the hidden state). But you need a word — not a vector. Two mathematical tools bridge this gap every single time the model generates a token: softmax converts raw scores into a proper probability distribution, and cross-entropy measures how wrong that distribution was during training. Every word you have ever seen from ChatGPT, Claude, or Llama came through these two functions.

§ 02

Softmax — live playground

Softmax

Softmax turns raw scores into a probability distribution

The model produces one raw score (logit) per vocabulary word — but raw scores can be any number: positive, negative, very large, very small. Softmax converts them to a clean probability distribution where every number is between 0 and 1, and all numbers sum to exactly 1. The formula: for each logit zᵢ, compute e^zᵢ / Σ(e^zⱼ for all j). Try dragging the sliders below — watch the bars respond instantly.

Why e^x — why not just normalize?

You could divide each score by the sum (like a simple ratio). But e^x does something better: it amplifies differences. If one logit is 3 and another is 1, their ratio is 3:1 — but e³:e¹ = 20:2.7 ≈ 7.5:1. The exponential makes the most confident prediction stand out more sharply, while never allowing any probability to reach exactly 0. Every word stays in the game — just with a very tiny probability.

What "probability distribution" means

A probability distribution is a list of numbers that all add up to 1.0. If the model assigns probability 0.7 to "cat", that means: if you let this model pick a word thousands of times from this exact state, about 700 out of 1,000 times it would choose "cat". Softmax guarantees this mathematical property — the output is always a valid distribution, regardless of what the raw logits look like.

Numerical stability trick — subtract the max

Real implementations use a small but critical fix

If a logit is very large (e.g. 1,000), computing e^1000 overflows to infinity on any computer. Real implementations first subtract the maximum logit from all logits before exponentiating: softmax(z)ᵢ = e^(zᵢ−max(z)) / Σe^(zⱼ−max(z)). This produces exactly the same probabilities — because subtracting a constant from all logits cancels out — but prevents numerical overflow. This trick is in every deep learning framework.

§ 03

Temperature

Temperature controls how "confident" the distribution looks

Before applying softmax, every logit is divided by a temperature value T. At T=1 (default), softmax runs normally. At T<1, logits get amplified — the distribution sharpens into a spike and the model becomes very "decisive." At T>1, logits get compressed — the distribution flattens and the model becomes more random and "creative." This is the single number behind every "creativity" slider you've seen in AI tools.

Low T — code generation

When writing code, you want predictable, correct syntax. Temperature 0.1–0.4 makes the model focus on the single most-likely next token. In most code generators, there is exactly one right way to close a bracket or continue a function signature. Low temperature exploits this — the model acts almost deterministically.

High T — creative writing

When writing a poem or brainstorming, you want surprising word choices. Temperature 0.8–1.2 flattens the distribution so lower-probability (unexpected) words get a real chance of being selected. This is why "creative mode" in LLM tools tends to produce more vivid, unusual language — the math of temperature makes it genuinely sample from the long tail.

Top-p sampling (nucleus sampling)

A complement to temperature, not a replacement

Top-p sampling (Holtzman et al. 2020) takes the smallest set of tokens whose cumulative probability exceeds p (e.g. p=0.9), then samples only from that set. This adapts dynamically — when the model is confident, the nucleus is small (just a few words). When uncertain, the nucleus is large (many plausible words). In practice top-p is used together with temperature: temperature shapes the distribution, then top-p truncates its tail. The modern sampler stack also includes min-p, top-k, and repetition penalties — and reasoning models often ship with fixed recommended temperatures rather than exposing a free dial.

§ 04

Cross-entropy loss

Cross-entropy measures how wrong the model's probability was

During training, we know the correct next word. Cross-entropy asks: what probability did the model assign to that correct word? The formula is simply −log(p_correct). If the model was very confident and correct (p≈1.0), the loss is near 0. If the model was confident but wrong (p≈0.0), the loss is very large. The gradient of this loss tells every weight in the network which direction to move to become more correct next time.

Why −log(p)?

The negative log has exactly the right shape: when p=1.0, −log(1)=0 (no loss — perfect). When p=0.5, −log(0.5)=0.69 (moderate loss). When p=0.01, −log(0.01)=4.6 (high loss). And as p→0, the loss approaches infinity — the model is punished extremely harshly for being confident and wrong. This asymmetry is intentional: being certain and wrong is much worse than being uncertain.

Perplexity = exp(loss)

Perplexity is the standard metric for language model quality. It equals e raised to the average cross-entropy loss. Intuitively, perplexity is "how many words was the model choosing between on average?" A perplexity of 10 means the model was effectively guessing among 10 equally likely words at each step. Lower is better. GPT-2 on Wikipedia: ~35; frontier models score far lower, though exact figures are rarely published. One caveat: perplexities are only comparable between models that share a tokenizer — different vocabularies make the per-token numbers incommensurable. A perfect model would have perplexity 1.0.

§ 05

How training uses them

Training loop

Softmax + cross-entropy together form the training loop's feedback signal

During training, the model processes a sentence and must predict each next word. Softmax converts its raw scores to probabilities. Cross-entropy compares those probabilities to the actual correct words. The resulting loss flows backwards through every weight in the network — nudging each weight slightly so the model would make a better prediction next time. Across thousands of GPUs, millions of token predictions are scored every second — though the weights themselves update only about once per second or slower, since each optimizer step batches millions of tokens. Every ability the model has came from this loop.

SFT: train on completions only

During supervised fine-tuning, cross-entropy loss is computed ONLY on the assistant's response tokens, not on the user's prompt. The model is taught to generate good answers — not to reproduce questions. This is the train_on_responses_only setting in TRL's SFTTrainer. Without this, the model wastes capacity trying to predict the prompt it was given.

Loss curves — what good training looks like

During healthy training, the loss should decrease smoothly over time. A rough heuristic for supervised fine-tuning: training loss dropping below ~0.2 can signal overfitting — the model is memorising examples rather than learning generalisable patterns. This threshold is dataset-dependent, and it does not apply to pretraining, where loss typically converges around 1.8–2.2 nats. Validation loss rising while training loss falls confirms overfitting. The ideal: both curves decrease together, converging to a low stable value.

§ 06

In attention too

Softmax in attention

Softmax appears twice — once in attention, once in the output head

Most students learn about softmax only at the output layer. But it plays an equally critical role deep inside every transformer block — in the attention mechanism itself. The attention softmax and the output softmax solve the same problem (turn raw scores into a valid distribution) but serve completely different purposes.

Softmax in attention — Layer 3

Attention weights: how much to look at each past word

In attention, Q·Kᵀ / √d_k produces a score for every pair of tokens (how relevant is token j to token i?). These raw scores are then passed through softmax to become attention weights — a probability distribution over all positions. This tells the model: "when thinking about position i, spend 40% of your attention on position 3, 35% on position 7, etc." The weights sum to 1 per query position, just like the output probabilities sum to 1.

Softmax in output head — Layer 6

Token probabilities: what word to generate next

At the output, the linear projection produces one logit per vocabulary word (~128,000 logits). Softmax converts these to token probabilities — the distribution you sample from to choose the next word. This softmax produces the probability that you see during training (cross-entropy uses −log of this) and at inference (temperature scales this before sampling).

The causal mask + softmax interaction

Why the mask uses −∞, not 0

In decoder-only models, tokens can only attend to past tokens — not future ones. The causal mask sets future positions to −∞ (negative infinity) before the attention softmax. When you compute e^(−∞), you get exactly 0 — meaning those future positions receive zero attention weight after softmax. If you used 0 instead of −∞, e^0 = 1, and future tokens would still receive a small (unwanted) weight. The −∞ trick ensures the mask is mathematically perfect.

§ 07

The playground.

Theory above, instrument below. This interactive panel runs live in the page — drag, type, and watch the mechanism respond.

Playground · Softmax + CEOpen full screen ↗

§ 08

Softmax meets cross-entropy.