TL;DR
- DDPM (Ho et al., 2020, arXiv:2006.11239) reformulated diffusion as a denoising task with a simple mean-squared-error loss, achieving image quality competitive with GANs.
- DDIM (Song et al., 2020, arXiv:2010.02502) showed that the same trained DDPM could be sampled deterministically by interpreting the reverse process as an implicit ODE — same quality, 20-50× fewer steps.
- Together they form the backbone of practical diffusion: DDPM for training, DDIM (or its higher-order successors) for sampling.
- Modern samplers — DPM-Solver, UniPC, Heun — extend DDIM's ODE view to higher-order numerical integration for further speedup.
DDPM: Diffusion as Denoising#
Earlier diffusion formulations (Sohl-Dickstein 2015, Song & Ermon 2019) trained models on the score function ∇_x log p(x) at every noise level. The maths was elegant but training was finicky.
Ho et al.'s 2020 contribution was reparameterising the objective. Instead of predicting the score, predict the noise that was added. Use a simple MSE loss between predicted noise and actual noise. Choose the noise schedule (linear or cosine) sensibly. With these choices, training became stable, the model architecture became standard (U-Net), and image quality jumped to GAN-competitive levels.
# DDPM forward (noising):
# x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * eps
#
# Training:
# t ~ Uniform(1..T)
# eps ~ N(0, I)
# x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * eps
# loss = || eps - model(x_t, t) ||^2DDPM Sampling#
Sampling in vanilla DDPM is stochastic and uses the full T (typically 1000) reverse steps. At each step, the model predicts ε, the algorithm uses it to compute μ_θ(x_t, t) and σ_t, then samples x_{t-1} ~ N(μ_θ, σ_t² I). Quality is high but 1000 forward passes per sample is impractical for production.
DDIM: Deterministic, Fast Sampling#
Song et al.'s 2020 paper showed that the DDPM reverse process can be reinterpreted as a discretisation of an ordinary differential equation. The same trained noise-prediction network can be sampled along that ODE deterministically, with no added noise at each step.
The DDIM update is: x_{t-1} = √(ᾱ_{t-1}) · ((x_t − √(1 − ᾱ_t) · ε_θ) / √(ᾱ_t)) + √(1 − ᾱ_{t-1}) · ε_θ. There is no stochastic term; given a starting x_T, the trajectory is fully determined.
Crucially, DDIM allows skipping steps. Rather than going through all T = 1000 time-step indices, sample at a subset — every 20th step, for instance — for 50 effective steps. Quality holds remarkably well, and inference becomes practical.
DDIM's determinism enables exact latent inversion: encode an image x_0 → x_T by running the ODE in reverse, then re-decode with the same trajectory to recover x_0 exactly. This is the basis of image-to-image editing techniques like prompt-to-prompt.
Modern Samplers#
| Sampler | Order | Typical steps | Notes |
|---|---|---|---|
| DDIM | 1 (Euler) | 25-50 | Original deterministic |
| DPM-Solver | 2-3 | 10-20 | Higher-order ODE integrator |
| DPM-Solver++ | 2-3 | 10-20 | SDE/ODE, fewer artefacts |
| Euler ancestral | 1 (stochastic) | 20-50 | Adds noise per step for diversity |
| Heun | 2 | 15-30 | Simple higher-order, popular default |
| UniPC | 3+ | 5-10 | Predictor-corrector |
| Consistency Models | 1-4 | 1-4 | Distilled, one-shot capable |
Noise Schedule#
Both DDPM and DDIM depend on a noise schedule {β_t} that determines how much noise is added at each forward step. DDPM used a linear schedule; iDDPM (Nichol & Dhariwal, 2021) introduced the cosine schedule that became dominant. The schedule controls how the signal-to-noise ratio decays over time, and small choices materially affect sample quality.
Rectified Flow (Liu et al., 2023) re-derived the schedule from an optimal-transport perspective, producing straighter trajectories that need fewer sampling steps. SD3 and FLUX use rectified-flow training and benefit from the straighter paths.
Practical Recipe#
If you train a diffusion model today, the canonical recipe is: train with DDPM-style noise-prediction loss using a cosine or rectified-flow schedule; sample with DPM-Solver++ or Heun at 20-30 steps for quality, or distill to a consistency model for 1-4 step sampling at production latency. Cross-attention to a text encoder provides conditioning; classifier-free guidance steers strength.