Running Models Module 17 8 min ▶ Narrated

The full pipeline as
one highway.

Where every previous module fits.

Prerequisites·Foundation Modalities
The Highway hero illustration
Narration · Module 17
The Highway
0:00 / –:––
§ 01

Core ideas

THE HIGHWAY

One road from bytes to behavior

Every module in this course is a segment of a single highway: text → tokens → embeddings → N transformer blocks → logits → softmax → next token, looped until a stop token. Pretraining builds the road, post-training (RLHF/DPO) adds the guardrails, and inference is the traffic.
Raw textTokenizer (21)Embeddings (05)Attention + FFN ×N (12, 13)LogitsSoftmax + sampling (10)Next token → loop
Pipeline — steps light up in order
TWO LOOPS

Training loop vs. inference loop

Training: show the model trillions of tokens, compare every predicted next-token against the truth, nudge weights via cross-entropy loss (10, 02). Inference: freeze the weights, feed a prompt, sample one token, append it, repeat (08). Same highway, opposite directions — one writes the map, the other drives it.
WHERE THINGS PLUG IN

Every later topic is a bolt-on to this spine

LoRA (18) swaps small adapter weights into the blocks. RAG (20) edits the prompt before the highway starts. KV caching (08) memoizes the attention math between loop iterations. Vision models (04) just convert patches — instead of words — into the same token stream. If you can place a technique on the highway, you understand it.
2026 ON-RAMPS

Agents and thinking tokens extend the same road

Two extensions define 2026. The agent loop (29): the model emits a tool call instead of an answer, the result is appended to the context, and the highway runs again — on-ramps and off-ramps, same road. Thinking tokens (30): reasoning models drive extra laps before taking the exit — generating intermediate tokens the user never sees, trading inference compute for accuracy.
The Highway spotlight illustration
§ 02

The lesson

The Inference Highway Generating tokens requires running the entire transformer neural network for every single word. See why native generation is slow, how the KV Cache prevents redundant work, and how Speculative Decoding uses a smaller model to predict the future.
§ 03

Further reading.

Done with The Highway?
Mark it complete — progress is saved in your browser and shows on the course map.