Cross-Attention — AI Learning Course

§ 01

The lesson

Teaching the LLM to 'See' with Image Patches

The Multimodal Bridge

How Text and Vision Connect

Language models only understand 1D sequences of tokens. To process an image, a Vision Transformer (ViT) first slices the image into a grid of "patches". Each patch is compressed into a dense vector (just like a word embedding!).

In 2026 VLMs, those patch vectors pass through a small learned projector and are inserted directly into the text token stream — the LLaVA pattern — or fused into one stream from the very first layer. The historical alternative is Flamingo-style gated Cross-Attention (2022): the LLM sends out a Query to the image patches, retrieving visual Key/Values at each layer. That pattern lives on in Whisper-style audio models and some video models.

Bridging the two eras sits BLIP-2's Q-Former: a small learned query transformer that compresses an image's features into just a handful of tokens for the LLM.

Then

Flamingo (2022): visual features injected into a frozen LLM through gated cross-attention layers.

Now · June 2026

Decoder-only VLMs insert projected image tokens straight into the context (LLaVA-style) or train early-fusion from scratch; cross-attention survives mainly in Whisper-style audio and some video models.

Interactive: The Cross-Attention Engine

Click "Feed Image & Ask Question". Watch as the image is tokenised into patches, and the LLM explicitly "looks" at specific visual regions to gather evidence for answering the prompt: "What animal is in the photo?"

Early Fusion

Any-to-Any Native Models

The visualization above shows cross-attention (like Whisper). But modern models like GPT-4o and Gemini 1.5 use native early fusion. Audio, vision, and text tokens are mixed into the exact same stream from Layer 0. The transformer applies self-attention seamlessly across all modalities simultaneously, allowing real-time video/audio conversations without latency.

The CLIP Vector

Semantic Alignment

How does the LLM know an image patch of a "dog" relates to the text token "dog"? During pretraining, systems like CLIP process 400 million image-caption pairs using a contrastive loss (the billions came later, with LAION-scale datasets and SigLIP successors). This pulls matching concepts close together in a shared latent space — though a documented modality gap remains: image and text embeddings settle into offset cones rather than identical coordinates.

§ 02

The playground.

Theory above, instrument below. This interactive panel runs live in the page — drag, type, and watch the mechanism respond.

Playground · Cross-AttentionOpen full screen ↗

§ 03

Text and vision,
connected.