Vision Transformers — AI Learning Course

§ 01

The lesson

The Big Idea

One universal architecture

You have already learned how transformers process text by turning token sequences into embedding matrices. The same mechanism works on images and audio. The only hurdle is converting pixels and acoustic waveforms into sequences of vectors. Once converted, the attention engine cannot tell the difference.

Text tokensImage patchesAudio framesEmbedding vectorsTransformer attention × NOutput

Pipeline — steps light up in order

ViT

Vision Transformers (2020)

CLIP

Contrastive Language-Image (2021)

Whisper

Speech Transcription (2022)

The Core Problem

Images are 2D Grids

A transformer requires a flat sequence of tokens. The central idea of ViT: treat a photograph like a sentence by slicing it into square patches and feeding those patches as 'tokens'.

The Geometric Slicing

196 Patches per Photo

A standard 224×224 image cut into 16×16 chunks yields exactly 196 patches. Each patch is flattened into a 768-number vector (16×16 pixels × 3 color channels), then passed through a learned linear projection into the model's embedding width — 768 dimensions in ViT-Base specifically — before reaching the encoder.

Visualizing the Patch Pipeline

224×224 image → 768 Array Flatten → Transformer
Engine

Classification Mechanics

Extracting a Singular Conclusion

After passing 196 image patches through the Transformer blocks, the system outputs 196 vectors. To classify an image as exactly a "Dog", we prepend a special `[CLS]` token onto the front of the image sequence. Because of self-attention, this single token attends to all 196 patches, effectively summarizing the whole image.

Attention Funneling

[CLS] P₁ P₂ P₃ … P₁₉₆ [CLS] attends to all patches across every layer.

Positional Embeddings in 2D

Because the transformer sees only a flat sequence, each patch embedding is stamped with a positional embedding that records where it sat in the original 2D grid.

CNN — Local First

Sliding Geometry Window

A CNN restricts its vision, sliding a small 3×3 window across the grid. It must understand local edges before looking globally. Data efficient, but it struggles to relate objects on opposite sides of a picture.

ViT — Global Dominance

Simultaneous Field Calculation

ViT drops that constraint. The very first attention layer can relate a patch in the top-left to a patch in the bottom-right in a single step.

Modern ViT Family Tree

DeiT (2021) — Distilled ViTs that train competitively without Google-scale datasets.

Swin Transformer (2021) — Microsoft's sliding-window hierarchical variant, strong on object detection.

Masked Autoencoders (MAE) — 2022. Mask out 75% of image patches and train the network to reconstruct them. This is self-supervised pretraining: the reconstruction task itself is discarded, and the learned features are fine-tuned on downstream tasks — not zero-shot classification.

ViT-22B (2023) — Google's 22-billion-parameter vision model.

Patches for 224px

0Bn

Parameter Ceiling

MAE mask ratio

Breakthrough Algorithm

Contrastive Mapping Alignment

CLIP runs two independent encoders: a text transformer and an image transformer. Given an image of a dog and the caption "A fluffy dog", the contrastive objective pulls both representations into the same numerical neighborhood and pushes apart pairs that don't match.

ViT Encoder
Image → Dot Product Intersection Map ← GPT Encoder
Text

Generative Engine Core

Text-to-image systems used CLIP in different ways across eras. DALL-E 1 (2021) was autoregressive — a discrete VAE plus a transformer, with no CLIP conditioning at all. DALL-E 2 (2022, "unCLIP") generated images from CLIP embeddings predicted from text. The image of a frozen CLIP encoder steering diffusion describes 2021-era CLIP-guided diffusion — a historical technique. DALL-E 3 and modern models instead condition the denoiser on T5 or LLM text encoders via cross-attention.

Zero-Shot Classification

Show CLIP a new photo and ask it to compare the image embedding against 1,000 written class descriptions. It clusters the photo to the closest one. Zero-shot CLIP matched ResNet-50's ~76% ImageNet accuracy — matching a 2015 supervised baseline without any fine-tuning was the achievement.

Then

CLIP (2021) was the universal vision-language encoder — nearly every multimodal system was built on it.

Now · June 2026

SigLIP/SigLIP 2 is the standard VLM vision backbone, DINOv2/v3 dominates self-supervised visual features, and native-resolution patching (NaViT-style) has replaced fixed 224×224 inputs.

Audio Matrix Theory

16,000 Numbers per Second

Standard speech files are sampled at 16kHz, so a 30-second excerpt contains 480,000 individual samples. Feeding nearly half a million data points into standard $O(N^2)$ Transformer attention is intractable — memory grows quadratically with sequence length.

1. Raw Waveform 480k Samples → 2. Log-Mel Spectrogram 80 Bins × 3k Frames (Tractable!)

Spectrogram Compression

A spectrogram graphs frequency magnitude over time. Warping it with a "Mel" filter (a scale modeled on human inner-ear sensitivity) compresses the raw waveform into a textured visual 'image' that standard vision-style networks can process.

The BERT Parallel

Predicting Masked Audio Voids

Just like BERT masks out words, Wav2Vec 2.0 ingests roughly 53,000 hours of unlabelled speech (the Libri-Light corpus) and masks spans of the audio. It uses attention to predict what the missing sound must be from the audio surrounding it.

Raw WaveDeep CNN FeaturesEncoder Engine (Random Dropout)Reconstruction Map

Pipeline — steps light up in order

Extreme Efficiency

After pretraining on tens of thousands of hours of unlabelled speech, the model can be fine-tuned for transcription with as little as 10 minutes of labelled transcripts.

Microsoft WavLM Integration

Later iterations like WavLM inject noise and overlapping speakers during masking, baking denoising and speaker-separation ability directly into the pretrained model.

Scale Over Elegance

680,000 Labelled Hours

OpenAI took the opposite route from Wav2Vec. Whisper trains on 680,000 hours of (audio, transcript) pairs scraped from the internet, using a standard encoder-decoder with cross-attention. It handles language detection, translation, and transcription across 99 languages — but it is not a streaming model: it processes audio in fixed 30-second log-Mel windows.

Encoder-Decoder Native Handoff

Log-Mel
Spectrogram Ingests ↑ Encoder
Loop ×N Cross-Attn → Decoder
Auto-regressor Writes ↑ "The AI
Transcribes…"

Architectural Hallucinations

Whisper's known flaw is its reliance on language modelling. Because the decoder is a language model trying to 'read' the audio, when it hits a stretch of silence or static the auto-regressor often guesses a plausible sentence ending — hallucinating fluent prose over empty air.

The Great Unification Phase

Everything Degrades to Tokens

Omni-modal architectures like Gemini 1.5 and GPT-4o discard separate isolated brains. (Claude 3.5 Sonnet, from the same era, had vision but not native audio.) When every picture can be encoded into a patch sequence and every audio wave converted into discrete acoustic tokens, you pipe all of them into the same Transformer engine side by side.

Text TokensImage TokensAudio MatrixLive SpeechSyntax Data

Pipeline — steps light up in order

Native Reaction Latency

By bypassing rigid translation chains (Speech-To-Text → Text Engine → Text-to-Speech), the GPT-4o native omni-model responds to audio in about 320ms on average (232ms at minimum) — close enough to human conversational timing to support natural interruptions.

Sensory Modality Hardware Filter Step Flagship Models Engine Output Type Pure Text BPE Tokenization → Dict Lookup GPT, LLaMA, Claude Auto-Regressive Generative Photographic Square Grid Splitting → Arrays ViT, CLIP, Swin Deep Dimensional Matching Acoustic Audio Log-Mel Convolution → Frame Arrays Whisper, Wav2Vec 2.0 Explicit Transcription

Research Directives

> ViT — Dosovitskiy et al., (arXiv:2010.11929)
> MAE — He et al., Meta 2022 (arXiv:2111.06377)
> CLIP — Radford et al., OpenAI 2021 (arXiv:2103.00020)
> Whisper — Radford et al., OpenAI 2022 (arXiv:2212.04356)

Vision Transformers spotlight illustration

§ 02

The playground.

Theory above, instrument below. This interactive panel runs live in the page — drag, type, and watch the mechanism respond.

Playground · Vision TransformersOpen full screen ↗

§ 03

Vision Transformers
are tokens too.

The lesson

One universal architecture

Images are 2D Grids

196 Patches per Photo

Visualizing the Patch Pipeline

Extracting a Singular Conclusion

Attention Funneling

Positional Embeddings in 2D

Sliding Geometry Window

Simultaneous Field Calculation

Contrastive Mapping Alignment

16,000 Numbers per Second

Predicting Masked Audio Voids

680,000 Labelled Hours

Encoder-Decoder Native Handoff

Everything Degrades to Tokens

The playground.

Further reading.