Running Models Module 20 12 min ▶ Narrated ⌘ Playground

Retrieval-augmented
generation.

Semantic search, hybrid retrieval, GraphRAG.

Prerequisites·05 · Embeddings Modalities
RAG + Prompting hero illustration
Narration · Module 20
RAG + Prompting
0:00 / –:––
§ 01

Core ideas

WHY RAG

Retrieval beats retraining for facts

A model's weights freeze at training time and hallucinate beyond them. Retrieval-Augmented Generation fetches relevant documents at question time and pastes them into the prompt, so the model reasons over fresh, citable evidence. Updating knowledge becomes a database insert, not a training run.
QuestionEmbed query (05)Vector search top-kRe-rankStuff context into promptLLM answers with citations
Pipeline — steps light up in order
HYBRID

Dense + keyword beats either alone

Embedding search finds semantic matches ('cardiac arrest' ≈ 'heart stopped') but misses exact identifiers; BM25 keyword search nails part numbers and names but misses paraphrase. Production systems run both and fuse with reciprocal-rank fusion, then a cross-encoder re-ranks the final shortlist. Each stage exists because the previous one is fast but sloppy.
GRAPHRAG

GraphRAG: retrieve relationships, not just chunks

Chunk retrieval fails on questions that span documents ('how are supplier X and outage Y connected?'). GraphRAG pre-extracts an entity-relationship graph plus community summaries, so retrieval can walk connections instead of matching paragraphs. Costlier to build, dramatically better at multi-hop and 'summarize everything about…' questions.
FAILURE MODES

Most 'RAG is broken' is chunking or evaluation

Chunks that split tables mid-row, top-k too small for the question, context stuffed past the model's effective attention, no grounding instruction in the prompt — these cause most failures. Measure retrieval (did the right passage arrive?) separately from generation (did the model use it?), or you will fix the wrong half.
RAG + Prompting spotlight illustration
§ 02

The lesson

Retrieval-Augmented Generation (RAG) Large Language Models are frozen in time the day they finish training. If you ask them about proprietary data or recent events, they confidently hallucinate. RAG solves this by turning the system into an "open-book test". But first, data must be systematically parsed, chunked, transformed into mathematical vectors, and pushed into a dense Vector Database.
2026

Agentic retrieval — the model drives the search

The single-shot embed → top-k → stuff pipeline is now the baseline, not the state of the art. In agentic retrieval the model issues iterative searches and greps, reads results, and decides what to fetch next — looping until it has what it needs. The long-context-vs-RAG decision is now a real engineering choice: long context wins when one bounded corpus can be read whole into the prompt; RAG wins for fresh, large, or citation-requiring corpora. And MCP (Model Context Protocol) is how 2026 models attach to data sources in the first place.
Then
One-shot vector RAG: embed, top-k, stuff the prompt (2023–24)
Now · June 2026
Hybrid retrieval + rerank as the floor; agentic multi-step retrieval on top
§ 03

The playground.

Theory above, instrument below. This interactive panel runs live in the page — drag, type, and watch the mechanism respond.

Playground · RAG + PromptingOpen full screen ↗
§ 04

Further reading.

Done with RAG + Prompting?
Mark it complete — progress is saved in your browser and shows on the course map.