Video adds time — and time breaks naive extrapolations
If diffusion can generate an image, can it generate 24 of them and call it video? No. Each frame would be coherent on its own but completely independent from the next. The result is flickering chaos. Video diffusion has to enforce temporal coherence — a person's face, clothing, lighting must stay consistent across frames, and motion must follow physics.
3D CONVOLUTIONS & TEMPORAL ATTENTION
Two architectural choices for fusing across time
Early video diffusion (VDM, Imagen Video) used 3D convolutions that span (height, width, time) jointly. The 2022–23 generation controlled attention cost with sparse or factorized temporal attention — each frame attending only to a few nearby frames. Modern systems run full 3D self-attention over spacetime patches in a compressed video-VAE latent: affordable because aggressive VAE compression shrinks the token count before attention ever runs.
CASCADE PIPELINE (HISTORICAL)
Early video models were 2-3 stage cascades
The Imagen Video / Make-A-Video pattern (2022–23): (1) a base model generated 16-32 low-resolution frames. (2) Temporal super-resolution increased frame count (16 → 60 frames). (3) Spatial super-resolution upscaled each frame (256×256 → 1024×1024). Each stage was its own model — the cascade made high-resolution video feasible on early compute budgets. Current models are single-stage latent DiTs, with at most an optional separate upscaler.
THE MODERN STACK
Causal VAE, latent DiT, flow matching
The current recipe: a causal video VAE compresses the clip across space and time, a latent DiT with full 3D attention denoises all spacetime patches jointly, and training uses a flow-matching objective rather than classic DDPM noise prediction. Distillation then cuts sampling to a handful of steps for near-realtime generation.
PHYSICS & CONSISTENCY
What makes Sora different from Gen-1
OpenAI's Sora technical report (2024) framed video generation as world simulation. The model learns implicit physics from large-scale training (objects don't teleport, gravity applies, occluded things reappear correctly) — that physical plausibility is what separated the 2024 generation from earlier models. By 2026 it is table stakes; the frontier delta is native synchronized audio and longer coherent shots — single generations that hold a scene, its sound, and its characters together. The cost remains enormous compute budgets and proprietary data.