Diffusion Math — AI Learning Course

§ 01

The lesson

Forward + reverse process · score matching · why noise schedules matter

FORWARD PROCESS

Noising a clean image, one step at a time

The forward process q(x_t|x_t-1) = N(x_t; √(1-β_t) x_t-1, β_t I) adds a small amount of Gaussian noise at each timestep. After T=1000 steps with a properly chosen noise schedule, the image is indistinguishable from pure Gaussian noise. The forward process has no learnable parameters — it's a fixed corruption schedule the model never has to predict.

REVERSE PROCESS

The model only has to learn one thing: predict the noise

The reverse process p_θ(x_t-1|x_t) is what the network learns (historically a U-Net, today usually a diffusion transformer). At each timestep, given the noisy image x_t, predict the noise ε that was added. Subtract a fraction of that prediction, get a slightly less noisy image x_t-1. Repeat T times. The loss is simply L = ||ε - ε_θ(x_t, t)||² — mean squared error between the true noise and the predicted noise.

SCORE MATCHING

Why noise prediction is equivalent to learning the data distribution

Song et al. (2021) showed that predicting noise is mathematically equivalent to estimating the score function ∇_x log p(x) — the gradient of the log-probability of the data distribution. Score matching is a well-studied technique going back to Hyvärinen 2005. Diffusion models are score-based models in disguise. This connection unifies DDPM with the earlier NCSN family of models.

NOISE SCHEDULE

Linear vs cosine vs sigmoid — the choice that affects everything

The schedule β_t controls how fast information gets destroyed. The original DDPM paper used a linear schedule from β₁=0.0001 to β_T=0.02. iDDPM (Nichol & Dhariwal 2021) showed a cosine schedule preserves more signal in the early timesteps, leading to better samples. Modern systems use various variance-preserving / variance-exploding schedules. Practical takeaway: the schedule choice can change FID by several points.

FLOW MATCHING

Rectified flow — the objective behind FLUX/SD3-class models

Flow matching trains the network to predict a velocity v = x₁ − x₀ along a straight path from noise to data, with continuous time t ∈ [0,1] — replacing discrete β schedules entirely. Two practical companions: classifier-free guidance trains the model both with and without conditioning, then extrapolates between the two predictions at sampling time — the most-used equation in practice. And sampling takes 20–30 steps rather than 1000 because samplers like DDIM and DPM++ take larger, smarter steps along the learned trajectory.

$Diffusion Math spotlight illustration$

§ 02

Diffusion math,
slowly.