How Text and Vision Connect
Language models only understand 1D sequences of tokens. To process an image, a Vision Transformer (ViT) first slices the image into a grid of "patches". Each patch is compressed into a dense vector (just like a word embedding!).
In 2026 VLMs, those patch vectors pass through a small learned projector and are inserted directly into the text token stream — the LLaVA pattern — or fused into one stream from the very first layer. The historical alternative is Flamingo-style gated Cross-Attention (2022): the LLM sends out a Query to the image patches, retrieving visual Key/Values at each layer. That pattern lives on in Whisper-style audio models and some video models.
Bridging the two eras sits BLIP-2's Q-Former: a small learned query transformer that compresses an image's features into just a handful of tokens for the LLM.
In 2026 VLMs, those patch vectors pass through a small learned projector and are inserted directly into the text token stream — the LLaVA pattern — or fused into one stream from the very first layer. The historical alternative is Flamingo-style gated Cross-Attention (2022): the LLM sends out a Query to the image patches, retrieving visual Key/Values at each layer. That pattern lives on in Whisper-style audio models and some video models.
Bridging the two eras sits BLIP-2's Q-Former: a small learned query transformer that compresses an image's features into just a handful of tokens for the LLM.

