Diffusion Models

←Back to Tech Tree

inventorycoverage

Diffusion Models #

Machine LearningDifficulty: ★★★★★Depth: 12Unlocks: 0

Denoising for generation. Score matching, noise schedules.

Interactive Visualization #

⏮◀◀▶▶STEP0.25x1xZOOM

t=0s

Core Concepts #

Key Symbols & Notation #

epsilon_theta(x_t,t) - the parameterized model that predicts the added noise (or equivalently represents the score)

Essential Relationships #

Prerequisites (2) #

Variational Autoencoders6 atomsStochastic Gradient Descent5 atoms

Advanced Learning Details

Graph Position #

179

Depth Cost

0

Fan-Out (ROI)

0

Bottleneck Score

12

Chain Length

Cognitive Load #

5

Atomic Elements

61

Total Elements

L4

Percentile Level

L3

Atomic Level

All Concepts (23) #

Teaching Strategy #

Quick unlock - significant prerequisite investment but simple final step. Verify prerequisites first.

Diffusion models turn the generative modeling problem into something deceptively simple: learn to undo noise. If you can reliably denoise a sample that has been corrupted in a controlled way, you can start from pure noise and iteratively “walk back” to realistic data.

TL;DR:

A diffusion model defines (1) a forward Markov chain that gradually adds Gaussian noise to data using a schedule {βₜ}, and (2) a learned network ε_θ(xₜ, t) that predicts the added noise (or equivalently the score ∇_x log p(xₜ)). Generation runs the reverse process: start from x_T ∼ 𝒩(0, I) and repeatedly denoise to sample x₀.

What Is a Diffusion Model? #

Diffusion models are generative models built around one central trick: instead of trying to model a complex data distribution p_data(x₀) directly, we define a destruction process that is easy to analyze (adding Gaussian noise over many small steps), then learn a construction process that reverses it.

Why this helps:

A diffusion model typically has two coupled processes indexed by discrete time t ∈ {1, …, T} (T can be 1000 in classical DDPMs; modern samplers often use fewer steps with improved solvers):

  1. Forward (noising) process q:
  1. Reverse (denoising) process p_θ:

The key learned object is a neural network that depends on the current noisy sample and the time index:

From ε_θ you can derive other equivalent parameterizations:

Even if you only remember one sentence: diffusion models work because “denoise step-by-step” is a tractable supervised learning task.

Connection to ideas you already know (VAEs):

Throughout this lesson we’ll use bold for vectors (e.g., x, ε), and assume images are flattened into vectors in ℝ^d (but all formulas hold per-pixel / per-dimension).

Core Mechanic 1: Forward Gaussian Noising (the diffusion / corruption process) #

The forward process is a time-indexed Markov chain that gradually destroys information by adding Gaussian noise according to a noise schedule.

Why define a forward process at all? #

Because it gives you a controlled way to generate paired training data:

This turns generative modeling into supervised learning.

The step-wise noising rule #

A common discrete-time forward process (DDPM) is:

q(xₜ | x_{t−1}) = 𝒩( xₜ ; √(αₜ) x_{t−1}, βₜ I )

where:

Intuition:

Over many steps, the signal decays and noise dominates.

Collapsing many steps into one: q(xₜ | x₀) #

A crucial simplification is that the composition of Gaussians stays Gaussian, so we can sample xₜ directly from x₀ without simulating every intermediate step.

Define the cumulative product:

ᾱₜ = ∏_{s=1}^t α_s

Then:

q(xₜ | x₀) = 𝒩( xₜ ; √(ᾱₜ) x₀, (1 − ᾱₜ) I )

Equivalently, we can reparameterize:

xₜ = √(ᾱₜ) x₀ + √(1 − ᾱₜ) ε, where ε ∼ 𝒩(0, I)

This single equation is the workhorse of diffusion training.

Pause and interpret each term:

Noise schedules: choosing βₜ (or ᾱₜ) #

The schedule determines how quickly you destroy information.

Design goals:

Common schedules:

ScheduleHow it behavesProsCons
Linear βₜβₜ increases linearlySimple, classic DDPMNot optimal SNR allocation
Cosine ᾱₜᾱₜ follows a cosine curveGood empirical performance, smoothNeeds careful discretization
Learned / piecewiseoptimized for sampling stepscan be very fast at inferenceadds complexity

A useful quantity is the SNR at time t:

SNRₜ = ᾱₜ / (1 − ᾱₜ)

Practical consequence: the network must learn to denoise across wildly different regimes. This is why the time embedding (conditioning on t) is essential.

Time conditioning #

The network ε_θ(xₜ, t) is conditioned on time t (or a continuous time value). In practice:

This enables one shared network to act like a family of denoisers, one for each noise level.

At this point you have a forward process q that is:

Next we learn the reverse.

Core Mechanic 2: Learned Denoiser / Score (ε_θ and its meanings) #

The learned model sits at the center of diffusion: ε_θ(xₜ, t). Superficially it’s “just” a network that predicts noise, but understanding what it represents explains why diffusion models work and how score matching appears.

Why predict noise? #

When we write

xₜ = √(ᾱₜ) x₀ + √(1 − ᾱₜ) ε

the only randomness (given x₀ and t) is ε ∼ 𝒩(0, I). If a network can infer the likely ε from xₜ, it can recover information about x₀.

Noise prediction is attractive because:

The standard training objective #

Sample:

Then minimize:

L(θ) = 𝔼_{x₀,t,ε} [ ‖ ε − ε_θ(xₜ, t) ‖² ]

This is the “simple loss” from DDPM.

Breathing room: what does minimizing this actually do? #

At a fixed t, xₜ is a noisy version of real data. There are many possible clean x₀ that could have produced a given xₜ, but the network learns the conditional expectation of noise given xₜ and t.

For MSE regression,

ε_θ*(xₜ,t) = 𝔼[ ε | xₜ, t ]

That conditional expectation encodes the structure of the data distribution because the posterior over x₀ given xₜ is shaped by p_data.

From noise prediction to x₀ prediction #

Rearrange the reparameterization:

xₜ = √(ᾱₜ) x₀ + √(1 − ᾱₜ) ε

Solve for x₀:

x₀ = ( xₜ − √(1 − ᾱₜ) ε ) / √(ᾱₜ)

So given ε_θ, we can define an implicit estimate of the clean sample:

\hat{x}₀(xₜ,t) = ( xₜ − √(1 − ᾱₜ) ε_θ(xₜ,t) ) / √(ᾱₜ)

This is used during sampling and for guidance methods.

From noise prediction to the score (score matching connection) #

The score of a density p(x) is:

∇_{x} log p(x)

Diffusion theory says the optimal denoiser corresponds to the score of the noisy distribution at each noise level.

For the forward process q, one can show a relationship of the form:

s_θ(xₜ,t) ≈ ∇_{xₜ} log q(xₜ)

and with noise prediction parameterization:

s_θ(xₜ,t) = − ε_θ(xₜ,t) / √(1 − ᾱₜ)

Up to conventions and scaling, predicting noise is equivalent to predicting the score.

Why the score matters #

If you know the score field ∇ log p(x), you know in which direction probability increases most steeply. Sampling methods (reverse SDE / Langevin-like dynamics) can use the score to push noise toward data manifold regions.

This gives diffusion a deep conceptual link: it’s not merely denoising; it’s learning a time-indexed family of score functions.

Weighting across timesteps #

Uniformly sampling t often works, but it can overweight uninformative very-noisy steps or underweight crucial mid-SNR steps.

Many systems use weighted losses:

L(θ) = 𝔼[ w(t) ‖ ε − ε_θ(xₜ,t) ‖² ]

Common ideas:

At this stage we have a trained ε_θ. Next we need to turn it into a generative procedure.

Core Mechanic 3: Reverse Generative Process (from noise to data) #

Generation is the reverse of the forward noising chain. The forward chain is easy because it’s Gaussian by construction; the reverse chain is hard because it depends on the unknown data distribution. The learned model ε_θ supplies the missing information.

Reverse-time Markov chain (DDPM view) #

We want transitions p_θ(x_{t−1} | xₜ) that approximately invert q(xₜ | x_{t−1}). DDPMs choose Gaussian reverse transitions:

p_θ(x_{t−1} | xₜ) = 𝒩( x_{t−1} ; μ_θ(xₜ,t), Σ_θ(t) )

A standard choice fixes Σ_θ(t) to a known variance (e.g., β̃ₜ I), and uses the network to compute the mean.

A common form (using noise prediction) is:

μ_θ(xₜ,t) = 1/√(αₜ) \left( xₜ − \frac{βₜ}{√(1 − ᾱₜ)} ε_θ(xₜ,t) \right)

Then sampling is:

x_{t−1} = μ_θ(xₜ,t) + σₜ z, z ∼ 𝒩(0, I)

with σₜ chosen from the variance schedule.

Pause: what is happening qualitatively?

This stochasticity helps match the true reverse distribution and avoids collapsing to a single mode.

Deterministic sampling (DDIM intuition) #

If you set the injected noise to zero (or modify the update), you get deterministic trajectories that still land on realistic samples. This is the basis for faster sampling variants.

You can think of DDPM vs DDIM as trading:

Continuous-time perspective (SDE view) #

In the score-based modeling framework, the diffusion process is described by an SDE:

dx = f(x, t) dt + g(t) dw

where w is Brownian motion.

The reverse-time SDE has drift that involves the score:

dx = [ f(x, t) − g(t)² ∇_{x} log p_t(x) ] dt + g(t) d\bar{w}

If you approximate the score with s_θ(x, t), you can numerically solve the reverse SDE to sample.

You do not need to memorize this SDE form to use diffusion models, but it explains:

Starting point and endpoint #

If T is large enough and the schedule is designed properly, q(x_T) is close to a standard normal regardless of p_data (information destroyed). That’s why the model can start from pure noise.

Classifier-free guidance (briefly, since it’s common) #

In conditional generation (text-to-image, class-conditional), you train ε_θ(xₜ,t, c) with conditioning c, and also sometimes drop c during training to learn an unconditional path.

At sampling time, combine:

ε_guided = (1 + w) ε_θ(xₜ,t,c) − w ε_θ(xₜ,t, ∅)

where w ≥ 0 is guidance scale.

Intuition: push samples toward regions that satisfy the condition more strongly, at the cost of reduced diversity if w is too high.

This is not strictly part of “diffusion basics,” but it’s a major reason diffusion works well in practice.

Application/Connection: How Diffusion Models Fit into the Generative Modeling Toolkit #

Diffusion models became dominant for high-fidelity generation because they combine stable training with flexible conditioning and strong likelihood-related foundations.

Comparing diffusion to VAEs and GANs #

Model familyCore ideaStrengthsWeaknesses
VAElatent variable model trained with ELBOstable training, explicit encodersamples can be blurry; trade-off via KL
GANadversarial gamesharp samples, fast inferenceunstable training, mode collapse, hard likelihood
Diffusionlearn to reverse noisingvery high quality, stable objective, flexible conditioningslow sampling (mitigated by fast samplers), compute-heavy

Since you know VAEs: notice the philosophical similarity:

In both cases, “start from something Gaussian” is the trick. Diffusion differs in that the latent is not low-dimensional; it’s the same dimension as x.

Where score matching shows up operationally #

Even if you never explicitly compute ∇ log p, the score view guides:

Practical considerations in real systems #

  1. Architecture
  1. Data scaling and variance
  1. Speed
  1. Evaluation

Mental model to keep #

Diffusion models are iterative refinement. Each step is a small denoise move that, when composed many times, produces a complex global transformation from noise to data.

If you want one unifying picture:

Worked Examples (3) #

Compute q(**x**ₜ|**x**₀) and sample **x**ₜ in one step #

Let a 1D “data point” be x₀ = 2. Suppose a diffusion schedule gives ᾱₜ = 0.81 at some timestep t. Sample ε ∼ 𝒩(0,1); take ε = −0.5 for this worked example. Compute xₜ.

  1. Use the closed form:

    xₜ = √(ᾱₜ) x₀ + √(1 − ᾱₜ) ε

  2. Compute √(ᾱₜ):

    √(0.81) = 0.9

  3. Compute √(1 − ᾱₜ):

    1 − ᾱₜ = 1 − 0.81 = 0.19

    √(0.19) ≈ 0.43589

  4. Plug in values:

    xₜ = 0.9 · 2 + 0.43589 · (−0.5)

    = 1.8 − 0.217945

    = 1.582055

  5. Interpretation:

    • •The clean signal contribution is 1.8.
    • •The noise contribution is about −0.218.
    • •At ᾱₜ = 0.81, the sample is still mostly signal (high SNR).

Insight: The ability to sample xₜ directly from x₀ (without simulating t steps) is what makes diffusion training efficient: you can train on arbitrary noise levels with one formula.

Recover \hat{**x**}₀ from ε_θ(**x**ₜ,t) (noise-prediction parameterization) #

Assume ᾱₜ = 0.36 and you observe a 2D noisy sample xₜ = (1.2, −0.3). A trained network predicts ε_θ(xₜ,t) = (0.5, −1.0). Compute \hat{x}₀.

  1. Use the reconstruction formula:

    \hat{x}₀ = ( xₜ − √(1 − ᾱₜ) ε_θ(xₜ,t) ) / √(ᾱₜ)

  2. Compute √(ᾱₜ):

    √(0.36) = 0.6

  3. Compute √(1 − ᾱₜ):

    1 − ᾱₜ = 0.64

    √(0.64) = 0.8

  4. Compute the noise term:

    √(1 − ᾱₜ) ε_θ = 0.8 · (0.5, −1.0)

    = (0.4, −0.8)

  5. Subtract from xₜ:

    xₜ − √(1 − ᾱₜ) ε_θ = (1.2, −0.3) − (0.4, −0.8)

    = (0.8, 0.5)

  6. Divide by √(ᾱₜ):

    \hat{x}₀ = (0.8, 0.5) / 0.6

    = (1.333…, 0.833…)

Insight: Noise prediction is not just a training trick: it provides a direct map from a noisy point back to an estimate of the clean data, which then defines the reverse-process mean updates.

One reverse DDPM-style update (conceptual numeric step) #

Consider 1D for simplicity. Suppose at timestep t you have xₜ = 0.7, αₜ = 0.9 (so βₜ = 0.1), ᾱₜ = 0.5, and the model predicts ε_θ(xₜ,t) = 0.2. Compute the reverse mean μ_θ(xₜ,t).

  1. Use the standard mean (noise-prediction form):

    μ_θ(xₜ,t) = 1/√(αₜ) · ( xₜ − (βₜ/√(1 − ᾱₜ)) ε_θ(xₜ,t) )

  2. Compute √(αₜ):

    √(0.9) ≈ 0.948683

    So 1/√(αₜ) ≈ 1.054093

  3. Compute √(1 − ᾱₜ):

    1 − ᾱₜ = 0.5

    √(0.5) ≈ 0.707107

  4. Compute the scaled noise subtraction:

    (βₜ/√(1 − ᾱₜ)) ε_θ = (0.1 / 0.707107) · 0.2

    ≈ 0.141421 · 0.2

    ≈ 0.028284

  5. Compute inside parentheses:

    xₜ − (...) = 0.7 − 0.028284 = 0.671716

  6. Multiply by 1/√(αₜ):

    μ_θ ≈ 1.054093 · 0.671716 ≈ 0.7081

  7. Interpretation:

    • •The reverse mean is slightly different from xₜ.
    • •Over many steps, these small corrections accumulate into a large denoising trajectory.

Insight: Reverse diffusion is many small, structured moves. Each move uses ε_θ to decide how to shift the sample toward higher-probability regions at that noise level.

Key Takeaways #

Common Mistakes #

Practice #

easy

You have ᾱₜ = 0.64 and a clean vector x₀ = (3, 0). You sample ε = (−1, 2). Compute xₜ using xₜ = √(ᾱₜ)x₀ + √(1 − ᾱₜ)ε.

Hint: Compute √(0.64) and √(0.36) first, then scale and add componentwise.

Show solution

√(ᾱₜ) = √(0.64) = 0.8 and √(1 − ᾱₜ) = √(0.36) = 0.6.

xₜ = 0.8(3,0) + 0.6(−1,2)

= (2.4, 0) + (−0.6, 1.2)

= (1.8, 1.2).

medium

Derive the \hat{x}₀ reconstruction formula from xₜ = √(ᾱₜ)x₀ + √(1 − ᾱₜ)ε when you replace ε by ε_θ(xₜ,t).

Hint: Rearrange the equation to isolate x₀; treat √(ᾱₜ) as a scalar.

Show solution

Start from:

xₜ = √(ᾱₜ)x₀ + √(1 − ᾱₜ)ε

Subtract the noise term:

xₜ − √(1 − ᾱₜ)ε = √(ᾱₜ)x

Divide by √(ᾱₜ):

x₀ = ( xₜ − √(1 − ᾱₜ)ε ) / √(ᾱₜ)

Replace ε with the model prediction ε_θ(xₜ,t):

\hat{x}₀(xₜ,t) = ( xₜ − √(1 − ᾱₜ) ε_θ(xₜ,t) ) / √(ᾱₜ).

hard

Show that if ᾱₜ → 0, then q(xₜ|x₀) approaches 𝒩(0, I) regardless of x₀. Use the mean and covariance of q(xₜ|x₀).

Hint: Look at the closed form q(xₜ|x₀) = 𝒩(√(ᾱₜ)x₀, (1 − ᾱₜ)I) and take limits.

Show solution

We have:

q(xₜ|x₀) = 𝒩( μₜ, Σₜ )

with

μₜ = √(ᾱₜ)x

Σₜ = (1 − ᾱₜ) I

Take ᾱₜ → 0:

μₜ → √0 · x₀ = 0

Σₜ → (1 − 0) I = I

Therefore q(xₜ|x₀) → 𝒩(0, I), independent of x₀. This formalizes the idea that the forward process eventually destroys all information about the original data.

Connections #

Variational Autoencoders

Score Matching

Stochastic Differential Equations for ML

Markov Chains

U-Net Architectures

Classifier-Free Guidance

Quality: A (4.2/5)

← back to treebrowse all →