Variational Autoencoders

←Back to Tech Tree

inventorycoverage

Variational Autoencoders #

Machine LearningDifficulty: ★★★★★Depth: 11Unlocks: 1

Generative models with latent variables. ELBO, reparameterization.

Interactive Visualization #

⏮◀◀▶▶STEP0.25x1xZOOM

t=0s

Core Concepts #

Key Symbols & Notation #

q_phi(z|x) - the variational/encoder distribution (parameters phi)

Essential Relationships #

Prerequisites (3) #

Bayesian Inference5 atomsNeural Networks6 atomsKL Divergence6 atoms

Unlocks (1) #

Diffusion Modelslvl 5

Advanced Learning Details

Graph Position #

169

Depth Cost

1

Fan-Out (ROI)

1

Bottleneck Score

11

Chain Length

Cognitive Load #

6

Atomic Elements

37

Total Elements

L2

Percentile Level

L4

Atomic Level

All Concepts (12) #

Teaching Strategy #

Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.

Variational Autoencoders (VAEs) are the bridge between probabilistic latent-variable modeling (Bayes, priors, posteriors) and deep learning (powerful function approximation). They give you a principled way to learn both a generator and an inference procedure—by optimizing a single tractable objective: the ELBO.

TL;DR:

A VAE posits a latent variable z that generates data x via a decoder p_θ(x|z) and a prior p(z). Because the true posterior p_θ(z|x) is intractable, we approximate it with an encoder q_φ(z|x). Training maximizes the ELBO: 𝔼_{q_φ(z|x)}[log p_θ(x|z)] − KL(q_φ(z|x) ‖ p(z)). The reparameterization trick (z = μ_φ(x) + σ_φ(x) ⊙ ε, ε ∼ 𝒩(0, I)) allows backpropagation through sampling.

What Is a Variational Autoencoder? #

Why VAEs exist (motivation) #

In many problems we want a model that can generate realistic data and also explain data in terms of hidden factors. Think:

A standard (deterministic) autoencoder learns an encoder f(x) → z and decoder g(z) → , but it does not define a probability distribution over data. You can reconstruct, but “sampling” z and decoding often produces arbitrary garbage because the latent space has no probabilistic structure.

A VAE fixes this by making the model explicitly probabilistic. It’s an instance of a latent-variable generative model:

This defines a joint distribution:

p_θ(x, z) = p_θ(x|z) p(z)

If we can learn θ well, then we can generate new data by sampling z ∼ p(z) and then x ∼ p_θ(x|z).

The core obstacle: posterior inference #

Given an observed x, the Bayesian posterior over latents is

p_θ(z|x) = p_θ(x|z) p(z) / p_θ(x)

where the marginal likelihood (evidence) is

p_θ(x) = ∫ p_θ(x|z) p(z) dz

In deep models, that integral is typically intractable.

But to learn the model, we’d like to maximize log p_θ(x) over θ for the dataset. And to do inference (encode x), we want p_θ(z|x). Both are blocked by the same intractable evidence integral.

The VAE idea in one sentence #

Introduce a tractable approximation q_φ(z|x) (the variational posterior / encoder) and optimize a lower bound on log p_θ(x) that is differentiable and scalable.

What makes it an “autoencoder”? #

The VAE has two neural networks:

Unlike a deterministic autoencoder, the encoder outputs a distribution (often Gaussian) and the decoder defines a likelihood (often Gaussian for real-valued data, Bernoulli for binary pixels, categorical for discrete tokens, etc.).

Typical choice of distributions #

A common (and very useful) baseline is:

This is not required, but it’s a common starting point because (1) sampling is easy, (2) KL terms often have closed form, and (3) reparameterization is straightforward.

Mental model #

Think of training a VAE as simultaneously:

  1. 1)Learning a generator that can map simple noise z into data space.
  2. 2)Learning an inference network that can map data x back to a distribution over plausible z.
  3. 3)Ensuring these two agree via a variational objective.

Core Mechanic 1: The ELBO (Evidence Lower Bound) #

Why we need a bound at all #

The quantity we would like to maximize for each datapoint x is log p_θ(x). But:

log p_θ(x) = log ∫ p_θ(x|z) p(z) dz

The log of an integral of a neural-network-defined density is generally not tractable.

Variational inference gives a workaround: introduce a distribution q_φ(z|x) that we can sample from and evaluate.

Deriving the ELBO (showing the work) #

Start with the log evidence and multiply inside by q_φ(z|x) / q_φ(z|x):

log p_θ(x)

= log ∫ p_θ(x, z) dz

= log ∫ q_φ(z|x) · [p_θ(x, z) / q_φ(z|x)] dz

Now apply Jensen’s inequality to log 𝔼[·] (log is concave):

log ∫ q_φ(z|x) · [p_θ(x, z) / q_φ(z|x)] dz

= log 𝔼_{q_φ(z|x)} [ p_θ(x, z) / q_φ(z|x) ]

≥ 𝔼_{q_φ(z|x)} [ log p_θ(x, z) − log q_φ(z|x) ]

Define the ELBO:

ELBO(θ, φ; x) = 𝔼_{q_φ(z|x)} [ log p_θ(x, z) − log q_φ(z|x) ]

So we have the bound:

log p_θ(x) ≥ ELBO(θ, φ; x)

Unpacking the ELBO into “reconstruction − regularization” #

Use p_θ(x, z) = p_θ(x|z) p(z):

ELBO

= 𝔼_{q_φ(z|x)}[ log p_θ(x|z) + log p(z) − log q_φ(z|x) ]

Group the last two terms as a KL divergence:

KL(q_φ(z|x) ‖ p(z))

= 𝔼_{q_φ(z|x)}[ log q_φ(z|x) − log p(z) ]

So:

ELBO

= 𝔼_{q_φ(z|x)}[ log p_θ(x|z) ] − KL(q_φ(z|x) ‖ p(z))

This is the form you implement.

Term 1: expected log-likelihood (reconstruction) #

𝔼_{q_φ(z|x)}[ log p_θ(x|z) ]

Term 2: KL to the prior (regularization) #

KL(q_φ(z|x) ‖ p(z))

The tightness of the bound #

A key identity connects ELBO and the true posterior:

log p_θ(x) = ELBO(θ, φ; x) + KL(q_φ(z|x) ‖ p_θ(z|x))

Derivation sketch (showing the work):

KL(q ‖ p_θ(z|x))

= 𝔼_q[ log q(z|x) − log p_θ(z|x) ]

= 𝔼_q[ log q(z|x) − log (p_θ(x, z) / p_θ(x)) ]

= 𝔼_q[ log q(z|x) − log p_θ(x, z) + log p_θ(x) ]

= log p_θ(x) − 𝔼_q[ log p_θ(x, z) − log q(z|x) ]

= log p_θ(x) − ELBO

Rearrange:

log p_θ(x) = ELBO + KL(q ‖ p_θ(z|x))

Because KL ≥ 0, ELBO is a lower bound. It becomes tight when q_φ(z|x) matches the true posterior.

Dataset objective #

For a dataset {xᵢ}ᵢ₌₁ᴺ, maximize:

∑ᵢ ELBO(θ, φ; xᵢ)

This trains:

A practical view: what gradients do we need? #

We need gradients of

𝔼_{q_φ(z|x)}[ log p_θ(x|z) ]

with respect to both θ and φ, plus gradients of the KL term.

That’s exactly why the reparameterization trick matters.

Core Mechanic 2: The Reparameterization Trick #

Why reparameterization is needed #

Suppose we approximate the expectation with Monte Carlo:

𝔼_{q_φ(z|x)}[ f(z) ] ≈ (1/L) ∑_{ℓ=1}^L f(z^{(ℓ)}), where z^{(ℓ)} ∼ q_φ(z|x)

If z^{(ℓ)} is produced by a sampling step that depends on φ, naive backprop gets stuck: the computational graph has a stochastic node.

One option is the score-function (REINFORCE) estimator:

∇_φ 𝔼_{q_φ}[f(z)] = 𝔼_{q_φ}[ f(z) ∇_φ log q_φ(z) ]

It’s unbiased but typically high-variance.

Reparameterization gives a lower-variance, pathwise gradient by rewriting the random variable as a deterministic function of φ and external noise.

The key idea #

If you can write

z = g_φ(ε, x), ε ∼ p(ε)

where p(ε) does not depend on φ, then

𝔼_{q_φ(z|x)}[ f(z) ] = 𝔼_{p(ε)}[ f(g_φ(ε, x)) ]

Now the randomness is in ε, not in the parameters. Gradients can flow through g_φ.

Gaussian case (most common) #

Let

q_φ(z|x) = 𝒩(μ_φ(x), diag(σ²_φ(x)))

Sample ε ∼ 𝒩(0, I) and define:

z = μ_φ(x) + σ_φ(x) ⊙ ε

Here ⊙ is elementwise multiplication.

This produces z distributed exactly as q_φ(z|x). And μ_φ, σ_φ are outputs of a neural net.

Backprop through the expectation #

Consider the reconstruction term for a single x:

ℒ_rec(θ, φ; x) = 𝔼_{q_φ(z|x)}[ log p_θ(x|z) ]

Using reparameterization:

ℒ_rec = 𝔼_{ε∼𝒩(0,I)}[ log p_θ(x| μ_φ(x) + σ_φ(x) ⊙ ε ) ]

Approximate with L samples:

ℒ_rec ≈ (1/L) ∑_{ℓ=1}^L log p_θ(x| μ_φ(x) + σ_φ(x) ⊙ ε^{(ℓ)})

Now ∇_φ is just ordinary backprop through μ_φ and σ_φ.

The KL term and closed form #

With a standard normal prior p(z) = 𝒩(0, I) and diagonal Gaussian q_φ, the KL has a closed form.

Let q = 𝒩(μ, diag(σ²)) and p = 𝒩(0, I). Then:

KL(q ‖ p) = (1/2) ∑ⱼ ( μⱼ² + σⱼ² − log σⱼ² − 1 )

This is extremely useful: it’s exact, differentiable, and cheap.

Show the structure (intuition) #

Each latent dimension j pays a penalty:

So the encoder is encouraged to produce a distribution “not too far” from 𝒩(0,1).

Practical parameterization: log-variance #

To keep σ positive, we usually output log σ² (call it s) and compute:

σ² = exp(s), σ = exp(0.5 s)

This avoids invalid (negative) variances and tends to be numerically stable.

Summary table: what you compute in a basic VAE #

PieceObjectTypical choiceRole
Priorp(z)𝒩(0, I)Defines “sampling space”
Encoderq_φ(zx)𝒩(μ_φ(x), diag(σ²_φ(x)))
Decoderp_θ(xz)Bernoulli or Gaussian
ObjectiveELBO𝔼_q[log p_θ(xz)] − KL(q ‖ p)
Trickz = μ + σ ⊙ εε ∼ 𝒩(0,I)Low-variance gradients

A note on discrete latents #

Reparameterization is straightforward for continuous distributions like Gaussians. For discrete latents, you need alternatives (Gumbel-Softmax / Concrete distributions, score-function estimators, or other variational relaxations). Many VAE lessons stop at Gaussians because they cover the most common and useful case.

Application/Connection: How VAEs Are Used (and What to Watch For) #

Generation #

After training, generation is simple:

  1. 1)Sample z ∼ p(z) = 𝒩(0, I)
  2. 2)Sample x ∼ p_θ(x|z) or take the decoder mean as a “typical” sample

Because the KL term kept q_φ(z|x) near p(z), the decoder has been trained on z values that look like prior samples.

Representation learning #

The latent z can be used as a learned feature representation. Common uses:

Caution: VAEs trade off reconstruction fidelity vs. latent regularity. If the KL term dominates, representations can become less informative.

Conditional VAEs (cVAEs) #

If you want generation conditioned on labels or attributes y:

This lets you generate samples with a chosen condition.

Anomaly detection #

A common heuristic: points with low ELBO (or high reconstruction error) are considered anomalous. This works best when the model class fits normal data well.

The “posterior collapse” problem (important failure mode) #

In powerful decoders (e.g., autoregressive text decoders), the model may learn to ignore z entirely:

Then z carries little information about x.

Common mitigations #

TechniqueWhat it changesWhy it helps
KL annealingSlowly increase KL weight from 0 → 1Gives encoder time to learn informative latents
β-VAEUse β · KL with β ≠ 1β < 1 encourages information; β > 1 encourages disentanglement
Free bitsKL term has a per-dim minimumPrevents KL from collapsing too aggressively
Weaker decoderReduce decoder capacityForces use of z

β-VAE: a small change with big effects #

Objective:

ELBO_β = 𝔼_q[log p_θ(x|z)] − β KL(q ‖ p)

Importance-weighted autoencoders (IWAE) #

ELBO uses one sample (or few) and is a lower bound. IWAE tightens the bound using multiple samples:

log p_θ(x) ≥ 𝔼_{z₁:K ∼ q}[ log ( (1/K) ∑ₖ p_θ(x, zₖ) / q(zₖ|x) ) ]

This can improve generative modeling but changes optimization dynamics.

Connection to diffusion models (why this node unlocks them) #

Diffusion models also involve:

Conceptually, VAEs train a generator with latent variables via a variational bound; diffusion trains a generator via denoising/score objectives. Understanding:

makes diffusion objectives feel much less mysterious.

Worked Examples (3) #

Derive the ELBO decomposition into reconstruction − KL #

Given a latent-variable model p_θ(x, z) = p_θ(x|z) p(z) and an approximate posterior q_φ(z|x), show that ELBO = 𝔼_q[log p_θ(x|z)] − KL(q_φ(z|x) ‖ p(z)).

  1. Start from the ELBO definition:

    ELBO = 𝔼_{q_φ(z|x)}[ log p_θ(x, z) − log q_φ(z|x) ].

  2. Substitute the joint factorization:

    log p_θ(x, z) = log p_θ(x|z) + log p(z).

  3. Plug in and separate expectations:

    ELBO = 𝔼_q[ log p_θ(x|z) + log p(z) − log q(z|x) ]

    = 𝔼_q[ log p_θ(x|z) ] + 𝔼_q[ log p(z) − log q(z|x) ].

  4. Recognize the KL divergence:

    KL(q(z|x) ‖ p(z)) = 𝔼_q[ log q(z|x) − log p(z) ].

    Therefore:

    𝔼_q[ log p(z) − log q(z|x) ] = − KL(q(z|x) ‖ p(z)).

  5. Conclude:

    ELBO = 𝔼_{q_φ(z|x)}[ log p_θ(x|z) ] − KL(q_φ(z|x) ‖ p(z)).

Insight: The ELBO cleanly separates “fit the data” (expected log-likelihood) from “keep latents well-behaved for sampling” (KL to the prior). This is the central tradeoff in VAEs.

Compute KL(q ‖ p) for diagonal Gaussians (standard VAE case) #

Let q(z) = 𝒩(μ, diag(σ²)) and p(z) = 𝒩(0, I). Derive KL(q ‖ p) = (1/2) ∑ⱼ ( μⱼ² + σⱼ² − log σⱼ² − 1 ).

  1. Write log densities (up to constants) for d dimensions.

    For p:

    log p(z) = −(1/2) ∑ⱼ ( zⱼ² + log 2π ).

    For q:

    log q(z) = −(1/2) ∑ⱼ ( (zⱼ−μⱼ)²/σⱼ² + log σⱼ² + log 2π ).

  2. Start from KL definition:

    KL(q ‖ p) = 𝔼_q[ log q(z) − log p(z) ].

    The log 2π constants cancel.

  3. Compute the difference inside the expectation:

    log q − log p

    = −(1/2) ∑ⱼ [ (zⱼ−μⱼ)²/σⱼ² + log σⱼ² − zⱼ² ].

  4. Take expectation under q. Use facts:

    If zⱼ ∼ 𝒩(μⱼ, σⱼ²), then

    𝔼[(zⱼ−μⱼ)²] = σⱼ²,

    𝔼[zⱼ²] = Var(zⱼ) + (𝔼[zⱼ])² = σⱼ² + μⱼ².

  5. Substitute expectations:

    𝔼_q[(zⱼ−μⱼ)²/σⱼ²] = σⱼ²/σⱼ² = 1,

    𝔼_q[zⱼ²] = σⱼ² + μⱼ².

  6. So for each j:

    𝔼_q[ (zⱼ−μⱼ)²/σⱼ² + log σⱼ² − zⱼ² ]

    = 1 + log σⱼ² − (σⱼ² + μⱼ²).

  7. Therefore:

    KL(q ‖ p)

    = −(1/2) ∑ⱼ [ 1 + log σⱼ² − σⱼ² − μⱼ² ]

    = (1/2) ∑ⱼ [ μⱼ² + σⱼ² − log σⱼ² − 1 ].

Insight: This closed-form KL is why the Gaussian VAE is so popular: you get exact regularization without needing Monte Carlo estimates, and gradients are stable.

Reparameterization in practice: differentiating through a sample #

Let q_φ(z|x) = 𝒩(μ_φ(x), diag(σ²_φ(x))). Show how to rewrite an expectation 𝔼_{q_φ}[f(z)] so ∇_φ can be computed by backprop.

  1. Define external noise ε ∼ 𝒩(0, I), independent of φ.

  2. Construct a deterministic transform:

    z = g_φ(ε, x) = μ_φ(x) + σ_φ(x) ⊙ ε.

  3. Rewrite the expectation:

    𝔼_{q_φ(z|x)}[ f(z) ] = 𝔼_{ε∼𝒩(0,I)}[ f( μ_φ(x) + σ_φ(x) ⊙ ε ) ].

  4. Approximate with a Monte Carlo sample ε¹:

    𝔼 ≈ f( μ_φ(x) + σ_φ(x) ⊙ ε¹ ).

  5. Differentiate:

    ∇_φ f( μ_φ(x) + σ_φ(x) ⊙ ε¹ )

    flows through μ_φ and σ_φ via the chain rule because ε¹ is treated as a constant during backprop.

Insight: Reparameterization moves the randomness to an input node (ε). Once you do that, the sampled z is just another differentiable layer in the network.

Key Takeaways #

Common Mistakes #

Practice #

medium

Show that log p_θ(x) = ELBO(θ, φ; x) + KL(q_φ(z|x) ‖ p_θ(z|x)).

Hint: Start from KL(q ‖ p_θ(z|x)) and substitute p_θ(z|x) = p_θ(x, z) / p_θ(x). Rearrange to isolate log p_θ(x).

Show solution

Let q = q_φ(z|x).

KL(q ‖ p_θ(z|x))

= 𝔼_q[ log q(z|x) − log p_θ(z|x) ]

= 𝔼_q[ log q(z|x) − log (p_θ(x, z) / p_θ(x)) ]

= 𝔼_q[ log q(z|x) − log p_θ(x, z) + log p_θ(x) ]

= log p_θ(x) − 𝔼_q[ log p_θ(x, z) − log q(z|x) ]

= log p_θ(x) − ELBO.

Therefore log p_θ(x) = ELBO + KL(q_φ(z|x) ‖ p_θ(z|x)).

easy

Assume p(z) = 𝒩(0, I) and q_φ(z|x) = 𝒩(μ, diag(σ²)) for a single datapoint. If μ = (2, 0) and σ² = (0.25, 4), compute KL(q ‖ p).

Hint: Use KL = (1/2)∑ⱼ(μⱼ² + σⱼ² − log σⱼ² − 1). Be careful: the formula uses σⱼ², not σⱼ.

Show solution

Compute per dimension.

j=1: μ₁² = 4, σ₁² = 0.25, log σ₁² = log 0.25 = −1.386294...

Term₁ = 4 + 0.25 − (−1.386294) − 1 = 4.636294...

j=2: μ₂² = 0, σ₂² = 4, log σ₂² = log 4 = 1.386294...

Term₂ = 0 + 4 − 1.386294 − 1 = 1.613706...

Sum = 6.25

KL = (1/2) · 6.25 = 3.125.

hard

A VAE uses Gaussian likelihood p_θ(x|z) = 𝒩(μ_θ(z), σ_x² I) with fixed σ_x². Show that maximizing 𝔼_q[log p_θ(x|z)] is equivalent (up to a constant) to minimizing 𝔼_q[‖x − μ_θ(z)‖²].

Hint: Write out the log density of a Gaussian with fixed variance and drop terms that do not depend on θ.

Show solution

For d-dimensional x,

log p_θ(x|z) = −(d/2) log(2πσ_x²) − (1/(2σ_x²)) ‖x − μ_θ(z)‖².

Take expectation over q_φ(z|x):

𝔼_q[log p_θ(x|z)]

= −(d/2) log(2πσ_x²) − (1/(2σ_x²)) 𝔼_q[ ‖x − μ_θ(z)‖² ].

The first term is constant w.r.t. θ. Therefore maximizing 𝔼_q[log p_θ(x|z)] is equivalent to minimizing 𝔼_q[ ‖x − μ_θ(z)‖² ]. (The scaling 1/(2σ_x²) does not change the optimizer when σ_x² is fixed.)

Connections #

Quality: A (4.4/5)

← back to treebrowse all →