The Transformer Module 12a 5 min ▶ Narrated

Additive attention,
briefly.

The Bahdanau-style variant that scaled dot-product replaced.

Prerequisites·12 · Attention Modalities
Additive Attn hero illustration
Narration · Module 12a
Additive Attn
0:00 / –:––
§ 01

The original: additive attention

Bahdanau 2014

The first attention scored relevance with a small neural network

Attention predates the transformer by three years. Bahdanau et al. (2014) bolted it onto an RNN translation model: at each decoding step, a small learned network scored how relevant each input word was to the word being generated — score = v·tanh(W₁h + W₂s), where h is an encoder state and s is the decoder state. Because the two projections are added inside the tanh, this is called additive attention. It solved the fixed-bottleneck problem of sequence-to-sequence RNNs: instead of cramming a whole sentence into one vector, the decoder could look back at every input position.
Luong 2015

Simplification: just take the dot product

A year later, Luong et al. (2015) showed you could often drop the little scoring network entirely and measure relevance as a plain dot product between the decoder state and each encoder state. Same idea — compare two vectors, get a relevance score — with far less machinery. This is the direct ancestor of the Q·K score inside every transformer.
Why dot-product won

Hardware, not math, decided it

Additive attention is not mathematically worse — at small dimensions it can even score slightly better. Dot-product attention won because of how GPUs work: scoring every query against every key becomes one batched matrix multiplication, the single operation GPUs are most optimised for. Additive attention needs a per-pair MLP evaluation that cannot be fused as cleanly. When the transformer made attention the whole model, that efficiency gap became decisive. Additive attention is historical now — but it is the foundation the rest of this module refines.
Then
Additive scoring — v·tanh(W₁h + W₂s) — powered the RNN attention era (2014–2016).
Now · June 2026
Scaled dot-product attention everywhere, executed by FlashAttention kernels that tile the computation to fit GPU memory hierarchies.
Additive Attn spotlight illustration
§ 02

What & why

The big idea

Attention lets every word look at every other word and decide what matters

Before attention, neural networks processed words one at a time — each word only "saw" its immediate neighbours. This broke down for long-range dependencies like "The trophy didn't fit in the suitcase because it was too big" — where "it" could refer to either noun. Attention solves this by letting "it" directly compare itself to every other word simultaneously and decide which one is most relevant. This happens inside every single transformer block, at every layer, for every token.
O(N²)
Attention is quadratic in sequence length — the main scaling challenge
0 Q / 8 KV
Llama 3 70B runs 64 query heads sharing 8 KV heads (GQA) per layer
0K
A typical frontier context window — frontier models ship 1M+ tokens
Before attention — RNNs
Recurrent Neural Networks processed words one at a time, left to right. Information about early words had to pass through every subsequent step — like a game of telephone. By the time a 500-word document reached its end, the RNN's memory of the opening sentences was severely degraded. Long-range dependencies were nearly impossible.
After attention — Transformers
Attention computes relationships between ALL pairs of tokens simultaneously — in parallel. Word 1 can directly attend to Word 500. The distance between words is irrelevant. This is why transformers can handle documents, books, and eventually 128K+ token contexts — the architecture has no inherent concept of "too far away."
§ 03

Live heatmap

Interactive attention heatmap

Click any word — watch what the model attends to

Each cell shows how much attention the query word (row) pays to the key word (column). Brighter = more attention. These weights are learned — not hardcoded. A real transformer would produce different patterns depending on the layer and head. The patterns below are illustrative but grounded in empirically observed attention behaviour.
What makes a pattern meaningful
Syntactic heads learn grammatical relationships: a verb attending to its subject, a pronoun attending to its antecedent. Semantic heads learn meaning relationships: "Paris" attending to "France." Positional heads learn position: each token attending mostly to the previous token. A 32-head model has 32 specialised perspectives running simultaneously.
Not all heads are equal
Research shows that in large models, a small fraction of attention heads do most of the "interesting" work — handling coreference, syntax, and factual associations. Many heads appear to learn near-trivial patterns. This has motivated pruning research: removing 20–40% of attention heads from BERT causes minimal quality loss.
§ 04

Q · K · V playground

Query · Key · Value

Three linear projections — Q asks, K answers, V delivers

Every token's embedding is projected through three separate learned weight matrices to produce three vectors: the Query (what am I looking for?), the Key (what do I contain?), and the Value (what information do I carry?). The dot product of Q and K gives a relevance score. Softmax converts scores to weights. The weighted sum of V vectors is the output. Drag the sliders below to feel how each dimension affects the attention score.
Why scale by √d_k?

Without scaling, softmax saturates and gradients vanish

With high-dimensional vectors (d=512, 1024, 4096), the dot product Q·K grows proportionally to √d. A dot product of 50 vs 55 produces very different softmax outputs than a dot product of 5 vs 5.5. At large d, the scores become so large that softmax essentially puts all weight on the maximum — almost a hard argmax. This makes gradients near zero for non-maximum positions. Dividing by √d_k keeps scores in a regime where softmax behaves smoothly.
The Value vector's role
Q and K determine HOW MUCH to attend to each position. V determines WHAT information to extract from each position. They are separate projections so the model can learn different representations for "this is what I am" (K) vs "this is what I send" (V). This separation is what gives attention its expressiveness — a token can say "I am highly relevant to your query" (high K·Q) but carry completely different information in V.
Learned projections
W_Q, W_K, W_V are all learned weight matrices. In GPT-3 (d=12288, 96 heads, d_head=128): each of Q, K, V is shape [12288, 128] per head — about 1.57M parameters per matrix, or ~4.7M for Q+K+V combined per head. With 96 heads across 96 layers, attention projections alone account for billions of parameters. These weights are what the model learns during training — not the attention patterns themselves.
§ 05

Causal mask

Causal masking

Decoder-only models can only look backwards — never forwards

In a language model that generates text left-to-right, a token cannot use future tokens to predict itself — that would be cheating. The causal mask enforces this by setting all future attention scores to −∞ before the softmax. After softmax, e^(−∞) = exactly 0 — so future positions receive precisely zero attention weight. This mask is applied during every training step so the model learns to predict each token from only previous context.
Why −∞ and not 0?
If you set masked positions to 0, then e^0 = 1 — they still contribute a non-zero weight after softmax. Using −∞ means e^(−∞) = 0, giving exactly zero weight. This is mathematically precise and computationally efficient. The old BERT-era trick of substituting −10,000 for infinity is historical: modern code uses the dtype's minimum value or true −∞, and FlashAttention kernels handle causal masking internally without materialising a mask at all.
Encoder vs decoder masking
BERT-style encoders use bidirectional attention — no causal mask. Every token attends to every other token. This is why BERT cannot generate text: it needs future tokens to compute its representations. GPT-style decoders use the causal mask — each token only attends backwards. This is the architectural difference that makes GPT able to generate and BERT unable to.
Training efficiency: teacher forcing

The causal mask enables training on all positions simultaneously

At training time, we know the entire correct sequence. The causal mask lets the model compute predictions for all positions in one forward pass, in parallel — using the correct previous tokens as context (not the model's own predictions). This is called "teacher forcing." Without the causal mask, you would have to process each position sequentially. With it, training is massively parallelised across the entire sequence length.
§ 06

Multi-head

Multi-head attention

Run attention in parallel with different "perspectives" — each learning something different

Instead of one attention computation, multi-head attention runs h parallel attention operations ("heads"), each with its own learned Q, K, V projections. Each head learns to attend to a different type of relationship — some heads specialise in syntax, some in semantics, some in position, some in coreference. Their outputs are concatenated and projected back to the original dimension.

How the heads combine

head₁ head₂ head₃ … headₕ → concat → [head₁; head₂; … headₕ] × W_O → output Each head produces d_head dimensional output. h heads concatenated = h × d_head = d_model total. The output projection W_O mixes information from all heads into the final representation.
Grouped Query Attention (GQA)
Modern models (Llama 3, Mistral) use GQA: multiple Query heads share a single Key+Value head. Llama 3 70B: 64 Q heads, 8 KV heads — each KV head is shared by 8 Q heads. This reduces the KV cache size by 8× at inference time with minimal quality loss, making 128K+ context windows practical.
Multi-Query Attention (MQA)
The extreme version: all Q heads share one K+V pair. Used in Falcon and early Gemini. Even smaller KV cache — but can hurt quality for tasks requiring diverse attention patterns. GQA (used in Llama) is the compromise: groups of Q heads sharing KV, offering most of the memory benefit with less quality loss.
§ 07

Full formula

The complete attention formula

Everything in one equation — and what each part does

The full scaled dot-product attention formula is deceptively compact. Every element has a precise reason for being there. Walk through it component by component using the stepper below.
→ Layer 6: Output head
The softmax in attention weights Values. The softmax in the output head (Layer 6) selects the next token. Same operation, completely different purpose — one weights information, one picks a word.
→ Layer 7: KV cache
The K and V matrices for all past tokens are cached at inference time. Each new decode step only computes K and V for the one new token — everything else is reused from cache. This makes each decode step O(N) instead of O(N²); generating N tokens still totals O(N²), but the cache removes the redundant recomputation that would make it O(N³).
→ Layer 1: Positional encoding
RoPE (used in Llama 3) works by rotating the Q and K vectors before the dot product. The rotation angle encodes position, so Q·K naturally captures relative distance. Position is baked into the attention computation itself.
→ Layer 4: Transformer block
Attention is one sublayer inside the transformer block. After attention: Add & Norm (residual + LayerNorm), then the Feed-Forward Network, then another Add & Norm. The attention output is added to the input (residual), not replacing it.
§ 08

Further reading.

Done with Additive Attn?
Mark it complete — progress is saved in your browser and shows on the course map.