The Transformer Module 22 8 min ▶ Narrated ⌘ Playground

Q, K, V - the projections.

The matrices that turn embeddings into queries, keys, values.

Prerequisites·12 · Attention Modalities
Q / K / V hero illustration
Narration · Module 22
Q / K / V
0:00 / –:––
§ 01

Core ideas

THREE HATS

One embedding, three learned projections

Each token's embedding x is multiplied by three trained matrices: Q = xW_q, K = xW_k, V = xW_v. The same token wears three hats: Query — what am I looking for? Key — what can others find me by? Value — what content do I hand over if someone attends to me?
INTUITION

A library lookup, vectorized

Think of a card catalog: your query is the topic you walk in with, each book's key is its index card, and the value is the book's contents. Attention scores are Q·Kᵀ — how well your request matches each card — and the output is a weighted blend of the matching books' contents. Example: the query from "it" in "the cat drank because it was thirsty" lands hardest on the key of "cat", so cat's value flows into it's new representation.
WHY SEPARATE

Separate W_q and W_k make matching asymmetric

If tokens compared raw embeddings, 'similar' would be the only relationship the model could express. Separate projections let "it" seek nouns without being a noun itself — the matrices learn a lookup language where what-I-want and what-I-offer are different subspaces. This asymmetry is most of attention's expressive power.
HEADS

Multi-head = many small QKV sets in parallel

A 4096-dim model with 32 heads runs 32 QKV projections of 128 dims each. One head tracks syntax, another coreference, another positional patterns — then their outputs concatenate and pass through W_o. Cheap parallel specialists beat one expensive generalist. 2026 wrinkle: in practice models use grouped-query attention — Llama-3-8B runs 32 Q heads but only 8 shared K/V heads, shrinking the KV cache 4× (module 12a covers why).
Q / K / V spotlight illustration
§ 02

The lesson

QKV Matrix Projection Before a token can pay attention to other words, it must split its single dense embedding into three distinct mathematical roles: Query (Q), Key (K), and Value (V). It does this by multiplying its embedding vector against three massive learned weight matrices ($W_Q$, $W_K$, $W_V$).
§ 03

The playground.

Theory above, instrument below. This interactive panel runs live in the page — drag, type, and watch the mechanism respond.

Playground · Q / K / VOpen full screen ↗
§ 04

Further reading.

Done with Q / K / V?
Mark it complete — progress is saved in your browser and shows on the course map.