The Transformer Module 07 10 min ▶ Narrated ⌘ Playground

Encoder, decoder,
encoder-decoder.

Why GPT looks one way, BERT another, T5 a third.

Prerequisites·None Modalities
Architectures hero illustration
Narration · Module 07
Architectures
0:00 / –:––
§ 01

Full transformer

The original 2017 design

The encoder-decoder transformer — built for translation

The original transformer (Vaswani et al. 2017) was designed for machine translation: take a sentence in German, understand it fully (encoder), then generate an English sentence word-by-word (decoder). This required two separate stacks of transformer blocks: the encoder with full bidirectional attention, and the decoder with causal attention plus a cross-attention bridge to the encoder.
Cross-attention — the bridge

How the decoder reads the encoder's understanding

In cross-attention, the decoder generates Queries (Q) from its own tokens, but the Keys (K) and Values (V) come from the encoder's output. This lets the decoder ask "which part of the input is relevant right now?" at every generation step. Used in T5, BART, and Whisper (encoder reads audio, decoder generates text transcript using cross-attention to the encoder output).
Architectures spotlight illustration
§ 02

Encoder-only

Encoder-only models

Understanding-first — BERT and the bidirectional family

Encoder-only models remove the decoder entirely. They use bidirectional attention — every token can attend to every other token in both directions simultaneously. This gives each token a deeply contextual representation that depends on the full input sentence. These models excel at understanding tasks: classification, named entity recognition, question answering (extractive), semantic similarity, and search. Encoders are far from dead in 2026 — they survive as the embedding models and rerankers inside every RAG stack.
BERT pretraining objectives
Masked Language Modelling (MLM): Randomly mask 15% of tokens. Predict them from context in both directions. Enables bidirectional understanding impossible with causal masking.

Next Sentence Prediction (NSP): Given two sentences A and B, predict whether B follows A in the original text. Teaches inter-sentence relationships. (Later shown to be less important than MLM.)
Special tokens
[CLS] — prepended to every input. Its final output vector is used for classification tasks. Analogous to ViT's CLS token (borrowed from BERT).

[MASK] — replaces masked tokens during MLM pretraining.

[SEP] — separates two segments (e.g. question vs passage in QA). Tells the model where one text ends and another begins.
Key encoder-only models
BERT-base (2018) — 110M params, 12 layers, 12 heads. The model that popularised pretraining for NLP. Google Search uses BERT variants to understand queries.

RoBERTa (2019) — BERT trained more carefully: more data, longer training, larger batches, no NSP. Consistently outperforms original BERT. Became the standard baseline.

DeBERTa (2020) — Disentangled attention: uses separate vectors for content and position. Microsoft. Often best-in-class on NLU benchmarks.

ModernBERT (2024) — Answer.AI and LightOn's updated BERT architecture with flash attention, RoPE, and an 8,192 token context window. Trained on 2T tokens. Represents the current state of encoder-only models.
§ 03

Decoder-only

Decoder-only models

Generation-first — GPT and the causal family

Decoder-only models remove the encoder entirely. They use only causal (masked) self-attention — each token can only attend to tokens that came before it. This makes them natural text generators: given a context, they predict the next token, then the next, building up a response one token at a time. All modern frontier assistant models use this design.
The causal mask
The causal mask is an upper-triangular matrix of −∞. After softmax, −∞ becomes exactly 0 — so no attention flows from earlier positions to later ones. During training, all positions are computed simultaneously in one forward pass. At inference, each new token is generated one at a time, extending the context by 1 token per step.
The prompt IS the context
Unlike encoder-decoder models which have a separate encoding phase, decoder-only models simply prepend the prompt to the generation. The model attends back to the prompt tokens causally. This is why "system prompts" work: they are just tokens that appear before the user's input in the same sequence.
Key decoder-only models
GPT-2 (2019) — 1.5B params, 48 layers. OpenAI. First model large enough to produce fluent multi-paragraph text. Released as "too dangerous" — now considered tiny.

GPT-3 (2020) — 175B params. Few-shot learning emerged at this scale. The model that made people take LLMs seriously as general-purpose tools.

Llama 3 (2024) — Meta. Open weights. 8B, 70B, and 405B variants. 15T training tokens, RoPE positional encoding, GQA. The dominant open-source base model family.

Gemma 3 (2025) — Google. Small efficient models (1B–27B). Designed for consumer hardware. Strong on instruction following at small scale.

Mistral / Mixtral — European. Sliding window attention (SWA) and mixture of experts (MoE). Mixtral-8×7B routes each token through 2 of 8 expert FFN layers.
Mixture of Experts in 2026

MoE is now the frontier default

Mixtral was the open pioneer, but by 2026 mixture-of-experts is the default at the frontier rather than the exception. DeepSeek-V3 and R1, Llama 4, and most frontier models route each token through a small subset of expert FFNs, so only a fraction of the total parameters are active per token. The result: total parameter counts in the hundreds of billions or trillions, with the inference cost of a much smaller dense model. Architecturally these are still decoder-only transformers — MoE swaps the feed-forward sublayer for a routed bank of experts, leaving attention untouched.
§ 04

Why decoder wins

The convergence of the field on decoder-only

Why almost every frontier model is decoder-only

In 2018–2020, the field was divided: BERT-family for understanding, GPT-family for generation, T5-family for both. By 2023, virtually every major assistant model — GPT-4, Claude, Gemini, Llama, Mistral — is decoder-only. Here is why.
The key insight

Decoder-only + RLHF = everything

The breakthrough was realising that a decoder-only model, after pretraining on next-token prediction, can be converted into a useful assistant through SFT and RLHF — without any architecture changes. The prompt becomes the "encoder input" — the model attends to it causally before generating. This simplicity means all the engineering effort can go into scaling one architecture rather than maintaining two. The field converged on decoder-only as a result.
§ 05

Model catalogue

The model landscape (snapshot: early 2025)

Key models from each architecture family — click a card to explore

Every model below is a transformer at heart — the differences are in which components they use, what data they trained on, and what objectives they optimised for.
BERT Google 2018 110M MLM bidirectional
RoBERTa Meta 2019 125M improved BERT no NSP
DeBERTa v3 Microsoft 2021 86M disentangled SOTA NLU
ModernBERT Answer.AI + LightOn 2024 8K context RoPE 2T tokens
T5 Google 2019 60M–11B text-to-text relative pos
BART Meta 2019 400M denoising summarisation
FLAN-T5 Google 2022 instruction tuned 1800+ tasks
Whisper OpenAI 2022 39M–1.55B speech 99 languages
GPT-3 OpenAI 2020 175B few-shot closed
GPT-4o OpenAI 2024 multimodal omni closed
Llama 3.3 Meta 2024 70B open weights RoPE
Claude 3.7 Anthropic 2025 reasoning 200K ctx closed
Gemini 2.5 Google 2025 multimodal 1M ctx closed
Mistral 7B Mistral AI 2023 7B SWA open
DeepSeek-R1 DeepSeek 2025 reasoning GRPO open
Qwen 2.5 Alibaba 2024 0.5B–72B multilingual open
Gemma 3 Google 2025 1B–27B consumer HW open
§ 06

The playground.

Theory above, instrument below. This interactive panel runs live in the page — drag, type, and watch the mechanism respond.

Playground · ArchitecturesOpen full screen ↗
§ 07

Further reading.

Done with Architectures?
Mark it complete — progress is saved in your browser and shows on the course map.