Veo 3.1 and Sora 2 bet on giant diffusion-transformers over spacetime patches with native synchronized audio — long, coherent, physics-plausible clips with dialogue in one pass. Runway Gen-4.5, Kling 3.0, Seedance 2.0, and Higgsfield race on control surfaces (camera moves, references, multi-shot storyboards) and creator workflow, while open weights (Wan 2.x, LTX-2, Hunyuan) run the same architecture locally. Raw fidelity is converging; the moat moved to controllability.
SPACETIME TOKENS
Video models tokenize time the way ViT tokenized space
A clip becomes a 3-D grid of patches — height × width × frames — compressed by a causal video VAE, then denoised by a transformer that attends across space and time. Temporal attention is what keeps a jacket the same shade in frame 1 and frame 120; it is also why video costs orders of magnitude more compute than stills.
WORKFLOW
Production reality: shots, not films
Every current tool generates seconds, not scenes. Real pipelines (including this studio's) decompose scripts into shots, generate per-shot with locked character references, then edit conventionally. The craft is consistency management across shots — face anchors, style LoRAs, seed discipline — covered in the persona-persistence module.
§ 02
The lesson
The platforms pushing past static latent diffusion to generate cinematic video from text prompts.
SNAPSHOT · JUNE 2026
The closed frontier: Veo, Sora, Kling
Google Veo 3.1 generates native synchronized audio and video in one pass — the 2025 inflection that ended the silent-clip era. OpenAI Sora 2 holds the photoreal frontier, but OpenAI announced deprecation of the Sora app and API across 2026, so availability is changing. Kling 3.0 leads on cinematic motion and adds a multi-shot storyboard mode for sequenced scenes.
SNAPSHOT · JUNE 2026
Control and multi-shot: Runway, Seedance
Runway Gen-4.5 is the control play: granular camera moves and reference-based character consistency across shots. Seedance 2.0 (ByteDance) generates multi-shot sequences with native audio+video. The 2026 differentiator is no longer raw fidelity — every frontier model is photoreal — it is how precisely you can direct the result.
OPEN WEIGHTS
The open-weight wave
Wan 2.x (Alibaba), Hunyuan Video (Tencent), and LTX-2 (Lightricks) put downloadable weights on local GPUs. They trail the closed frontier by months, not years, and they are the only option when a pipeline needs fine-tuning, full control, or on-prem rendering.
HOW THEY WORK
Parallel denoising, not frame-by-frame prediction
These models do not predict physics sequentially over time. A DiT video model compresses the clip into a video-VAE latent, then denoises all spacetime patches in parallel — every frame of the shot emerges together. Autoregressive frame-by-frame generation is a different, minority family aimed at the world-model direction (interactive simulation), not at cinematic clip generation.
Then
Silent 4-second clips with a separate lip-sync pass (2024).
Now · June 2026
Native audio+video in one generation, minute-class shots; the moat moved from fidelity to controllability.