The Transformer Module 19 8 min ▶ Narrated

Encoder-decoder,
revisited.

A return to the encoder-decoder pattern.

Prerequisites·07 · Architectures + 12 · Attention Modalities
Encoder-Decoder hero illustration
Narration · Module 19
Encoder-Decoder
0:00 / –:––
§ 01

Core ideas

THE SPLIT

Read fully, then write fully

An encoder-decoder model has two stacks: the encoder reads the whole input with bidirectional attention (every token sees every other), producing a contextual memory; the decoder writes the output autoregressively, attending both to what it has written (causal self-attention) and to the encoder's memory (cross-attention).
VS DECODER-ONLY

Why GPT dropped the encoder — and who kept it

Decoder-only models (GPT, Llama) fold reading and writing into one stack: simpler to scale, and one architecture handles any task framed as continuation. Encoder-decoder (T5, Whisper, translation models) still wins when input and output are different kinds of things — audio→text, document→summary — because the encoder can be bidirectional and the modalities stay cleanly separated.
CROSS-ATTENTION

Cross-attention is the bridge

In the decoder, queries come from the output-so-far while keys and values come from the encoder's output. Translating "the green house""la casa verde": when writing verde, the decoder's query attends to the encoder's green. Same QKV math as module 22 — only the source of K and V changes. This exact pattern returns in diffusion models, where image latents cross-attend to text prompts (23).
TODAY

The pattern outlived the architecture

Pure encoder-decoder LLMs are now rare, but the idea is everywhere: Whisper is enc-dec (audio encoder, text decoder), every text-to-image diffusion model conditions on a text encoder, and many multimodal LLMs bolt vision encoders onto decoder stacks — though the newest omni models increasingly use early fusion, feeding image tokens straight into the decoder. Learn the pattern once, recognize it in every modality.
Encoder-Decoder spotlight illustration
§ 02

The lesson

Inspector vs. Storyteller Transformers process sequences differently based on their architecture. Encoders (like BERT) look at the entire sentence at once in both directions, making them great at analyzing text. Decoders (like GPT) are causally masked, meaning they can only see the past and must generate the future one word at a time, making them master storytellers.
The Inspector (Encoder)

Bidirectional Attention. Reads left-to-right AND right-to-left simultaneously.

The Storyteller (Decoder)

Autoregressive Causally Masked Attention. Only sees the past.

§ 03

Further reading.

Done with Encoder-Decoder?
Mark it complete — progress is saved in your browser and shows on the course map.