←Back to Tech Tree
inventorycoverage
Variational Autoencoders #
Machine LearningDifficulty: ★★★★★Depth: 11Unlocks: 1
Generative models with latent variables. ELBO, reparameterization.
Interactive Visualization #
⏮◀◀▶▶STEP0.25x1xZOOM
t=0s
Core Concepts #
- -Latent-variable generative model: data x is generated from latent z via a decoder p_theta(x|z) with prior p(z) (joint p_theta(x,z)=p_theta(x|z)p(z))
- -Variational posterior (encoder) q_phi(z|x): a tractable parametric approximation to the true posterior p_theta(z|x)
- -Evidence Lower Bound (ELBO): the objective E_{q_phi(z|x)}[log p_theta(x,z) - log q_phi(z|x)] used to fit model and inference parameters
- -Reparameterization trick: express z = g_phi(epsilon,x) with epsilon drawn from a fixed noise distribution so gradients w.r.t. phi can backpropagate through sampling
Key Symbols & Notation #
q_phi(z|x) - the variational/encoder distribution (parameters phi)
Essential Relationships #
- -ELBO decomposition: log p_theta(x) = ELBO + KL[q_phi(z|x) || p_theta(z|x)], equivalently ELBO = E_{q_phi}[log p_theta(x|z)] - KL[q_phi(z|x) || p(z)]
Prerequisites (3) #
Bayesian Inference5 atomsNeural Networks6 atomsKL Divergence6 atoms
Unlocks (1) #
Diffusion Modelslvl 5
Advanced Learning Details
Graph Position #
169
Depth Cost
1
Fan-Out (ROI)
1
Bottleneck Score
11
Chain Length
Cognitive Load #
6
Atomic Elements
37
Total Elements
L2
Percentile Level
L4
Atomic Level
All Concepts (12) #
- Latent-variable generative model: a decoder p_θ(x|z) with a prior p(z) and marginal likelihood p_θ(x)=∫ p_θ(x|z)p(z) dz (usually intractable)
- Inference network / encoder q_φ(z|x): a neural-network parameterized approximate posterior that maps x to a distribution over latent z
- Evidence Lower Bound (ELBO): the objective used in VAEs that lower-bounds log p_θ(x)
- ELBO decomposition: ELBO as the sum of a reconstruction (expected log-likelihood) term and a KL regularizer
- Amortized variational inference: learning a single parametric mapping q_φ(z|x) shared across data points instead of per-datapoint variational parameters
- Reparameterization trick: expressing stochastic sampling from q_φ(z|x) as a deterministic, differentiable transform z = g_φ(ε, x) of noise ε to enable backpropagation through samples
- Stochastic gradient variational Bayes / SGVB estimator: using Monte Carlo samples (with reparameterization) to get low-variance unbiased gradient estimates of ELBO
- Monte Carlo estimation of expectations in the ELBO (using a small number of samples per datapoint during training)
- Gaussian encoder parameterization: common choice q_φ(z|x)=N(μ_φ(x), diag(σ^2_φ(x))) with the encoder outputting μ and σ
- Closed-form KL for common pairs (e.g., Gaussian q to standard normal prior) used to avoid sampling for the KL term
- Trade-off perspective: ELBO optimization trades reconstruction accuracy against closeness of q_φ to the prior (regularization of latent space)
- Posterior collapse (degenerate solution): phenomenon where q_φ(z|x) collapses to the prior p(z) and the decoder ignores z
Teaching Strategy #
Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.
Variational Autoencoders (VAEs) are the bridge between probabilistic latent-variable modeling (Bayes, priors, posteriors) and deep learning (powerful function approximation). They give you a principled way to learn both a generator and an inference procedure—by optimizing a single tractable objective: the ELBO.
TL;DR:
A VAE posits a latent variable z that generates data x via a decoder p_θ(x|z) and a prior p(z). Because the true posterior p_θ(z|x) is intractable, we approximate it with an encoder q_φ(z|x). Training maximizes the ELBO: 𝔼_{q_φ(z|x)}[log p_θ(x|z)] − KL(q_φ(z|x) ‖ p(z)). The reparameterization trick (z = μ_φ(x) + σ_φ(x) ⊙ ε, ε ∼ 𝒩(0, I)) allows backpropagation through sampling.
What Is a Variational Autoencoder? #
Why VAEs exist (motivation) #
In many problems we want a model that can generate realistic data and also explain data in terms of hidden factors. Think:
- •Images explained by pose, lighting, identity
- •Audio explained by speaker, phoneme content
- •Text explained by topic, style
A standard (deterministic) autoencoder learns an encoder f(x) → z and decoder g(z) → x̂, but it does not define a probability distribution over data. You can reconstruct, but “sampling” z and decoding often produces arbitrary garbage because the latent space has no probabilistic structure.
A VAE fixes this by making the model explicitly probabilistic. It’s an instance of a latent-variable generative model:
- •Sample latent z from a prior p(z)
- •Sample data x from a likelihood/decoder p_θ(x|z)
This defines a joint distribution:
p_θ(x, z) = p_θ(x|z) p(z)
If we can learn θ well, then we can generate new data by sampling z ∼ p(z) and then x ∼ p_θ(x|z).
The core obstacle: posterior inference #
Given an observed x, the Bayesian posterior over latents is
p_θ(z|x) = p_θ(x|z) p(z) / p_θ(x)
where the marginal likelihood (evidence) is
p_θ(x) = ∫ p_θ(x|z) p(z) dz
In deep models, that integral is typically intractable.
But to learn the model, we’d like to maximize log p_θ(x) over θ for the dataset. And to do inference (encode x), we want p_θ(z|x). Both are blocked by the same intractable evidence integral.
The VAE idea in one sentence #
Introduce a tractable approximation q_φ(z|x) (the variational posterior / encoder) and optimize a lower bound on log p_θ(x) that is differentiable and scalable.
What makes it an “autoencoder”? #
The VAE has two neural networks:
- •Encoder q_φ(z|x): maps x to parameters of a distribution over z
- •Decoder p_θ(x|z): maps z to parameters of a distribution over x
Unlike a deterministic autoencoder, the encoder outputs a distribution (often Gaussian) and the decoder defines a likelihood (often Gaussian for real-valued data, Bernoulli for binary pixels, categorical for discrete tokens, etc.).
Typical choice of distributions #
A common (and very useful) baseline is:
- •Prior: p(z) = 𝒩(0, I)
- •Encoder: q_φ(z|x) = 𝒩(μ_φ(x), diag(σ²_φ(x)))
- •Decoder:
- •For real-valued x: p_θ(x|z) = 𝒩(μ_θ(z), σ² I) (often fixed σ)
- •For binary pixels: p_θ(x|z) = Bernoulli(π_θ(z))
This is not required, but it’s a common starting point because (1) sampling is easy, (2) KL terms often have closed form, and (3) reparameterization is straightforward.
Mental model #
Think of training a VAE as simultaneously:
- 1)Learning a generator that can map simple noise z into data space.
- 2)Learning an inference network that can map data x back to a distribution over plausible z.
- 3)Ensuring these two agree via a variational objective.
Core Mechanic 1: The ELBO (Evidence Lower Bound) #
Why we need a bound at all #
The quantity we would like to maximize for each datapoint x is log p_θ(x). But:
log p_θ(x) = log ∫ p_θ(x|z) p(z) dz
The log of an integral of a neural-network-defined density is generally not tractable.
Variational inference gives a workaround: introduce a distribution q_φ(z|x) that we can sample from and evaluate.
Deriving the ELBO (showing the work) #
Start with the log evidence and multiply inside by q_φ(z|x) / q_φ(z|x):
log p_θ(x)
= log ∫ p_θ(x, z) dz
= log ∫ q_φ(z|x) · [p_θ(x, z) / q_φ(z|x)] dz
Now apply Jensen’s inequality to log 𝔼[·] (log is concave):
log ∫ q_φ(z|x) · [p_θ(x, z) / q_φ(z|x)] dz
= log 𝔼_{q_φ(z|x)} [ p_θ(x, z) / q_φ(z|x) ]
≥ 𝔼_{q_φ(z|x)} [ log p_θ(x, z) − log q_φ(z|x) ]
Define the ELBO:
ELBO(θ, φ; x) = 𝔼_{q_φ(z|x)} [ log p_θ(x, z) − log q_φ(z|x) ]
So we have the bound:
log p_θ(x) ≥ ELBO(θ, φ; x)
Unpacking the ELBO into “reconstruction − regularization” #
Use p_θ(x, z) = p_θ(x|z) p(z):
ELBO
= 𝔼_{q_φ(z|x)}[ log p_θ(x|z) + log p(z) − log q_φ(z|x) ]
Group the last two terms as a KL divergence:
KL(q_φ(z|x) ‖ p(z))
= 𝔼_{q_φ(z|x)}[ log q_φ(z|x) − log p(z) ]
So:
ELBO
= 𝔼_{q_φ(z|x)}[ log p_θ(x|z) ] − KL(q_φ(z|x) ‖ p(z))
This is the form you implement.
Term 1: expected log-likelihood (reconstruction) #
𝔼_{q_φ(z|x)}[ log p_θ(x|z) ]
- •Encourages z sampled from the encoder to decode into something that assigns high probability to the observed x.
- •With Gaussian likelihood and fixed variance, this becomes (up to constants) a negative squared error.
- •With Bernoulli likelihood, it becomes cross-entropy.
Term 2: KL to the prior (regularization) #
KL(q_φ(z|x) ‖ p(z))
- •Encourages the encoded distribution to stay close to the prior.
- •This makes sampling from p(z) produce meaningful decodes.
- •Prevents the encoder from using arbitrarily “spiky” posteriors just to reconstruct perfectly.
The tightness of the bound #
A key identity connects ELBO and the true posterior:
log p_θ(x) = ELBO(θ, φ; x) + KL(q_φ(z|x) ‖ p_θ(z|x))
Derivation sketch (showing the work):
KL(q ‖ p_θ(z|x))
= 𝔼_q[ log q(z|x) − log p_θ(z|x) ]
= 𝔼_q[ log q(z|x) − log (p_θ(x, z) / p_θ(x)) ]
= 𝔼_q[ log q(z|x) − log p_θ(x, z) + log p_θ(x) ]
= log p_θ(x) − 𝔼_q[ log p_θ(x, z) − log q(z|x) ]
= log p_θ(x) − ELBO
Rearrange:
log p_θ(x) = ELBO + KL(q ‖ p_θ(z|x))
Because KL ≥ 0, ELBO is a lower bound. It becomes tight when q_φ(z|x) matches the true posterior.
Dataset objective #
For a dataset {xᵢ}ᵢ₌₁ᴺ, maximize:
∑ᵢ ELBO(θ, φ; xᵢ)
This trains:
- •θ to make the decoder a good likelihood model
- •φ to make the encoder approximate the posterior under the current decoder
A practical view: what gradients do we need? #
We need gradients of
𝔼_{q_φ(z|x)}[ log p_θ(x|z) ]
with respect to both θ and φ, plus gradients of the KL term.
- •Gradients w.r.t. θ are usually straightforward: sample z and backprop through the decoder.
- •Gradients w.r.t. φ are subtle because z is sampled from q_φ, so the sampling operation depends on φ.
That’s exactly why the reparameterization trick matters.
Core Mechanic 2: The Reparameterization Trick #
Why reparameterization is needed #
Suppose we approximate the expectation with Monte Carlo:
𝔼_{q_φ(z|x)}[ f(z) ] ≈ (1/L) ∑_{ℓ=1}^L f(z^{(ℓ)}), where z^{(ℓ)} ∼ q_φ(z|x)
If z^{(ℓ)} is produced by a sampling step that depends on φ, naive backprop gets stuck: the computational graph has a stochastic node.
One option is the score-function (REINFORCE) estimator:
∇_φ 𝔼_{q_φ}[f(z)] = 𝔼_{q_φ}[ f(z) ∇_φ log q_φ(z) ]
It’s unbiased but typically high-variance.
Reparameterization gives a lower-variance, pathwise gradient by rewriting the random variable as a deterministic function of φ and external noise.
The key idea #
If you can write
z = g_φ(ε, x), ε ∼ p(ε)
where p(ε) does not depend on φ, then
𝔼_{q_φ(z|x)}[ f(z) ] = 𝔼_{p(ε)}[ f(g_φ(ε, x)) ]
Now the randomness is in ε, not in the parameters. Gradients can flow through g_φ.
Gaussian case (most common) #
Let
q_φ(z|x) = 𝒩(μ_φ(x), diag(σ²_φ(x)))
Sample ε ∼ 𝒩(0, I) and define:
z = μ_φ(x) + σ_φ(x) ⊙ ε
Here ⊙ is elementwise multiplication.
This produces z distributed exactly as q_φ(z|x). And μ_φ, σ_φ are outputs of a neural net.
Backprop through the expectation #
Consider the reconstruction term for a single x:
ℒ_rec(θ, φ; x) = 𝔼_{q_φ(z|x)}[ log p_θ(x|z) ]
Using reparameterization:
ℒ_rec = 𝔼_{ε∼𝒩(0,I)}[ log p_θ(x| μ_φ(x) + σ_φ(x) ⊙ ε ) ]
Approximate with L samples:
ℒ_rec ≈ (1/L) ∑_{ℓ=1}^L log p_θ(x| μ_φ(x) + σ_φ(x) ⊙ ε^{(ℓ)})
Now ∇_φ is just ordinary backprop through μ_φ and σ_φ.
With a standard normal prior p(z) = 𝒩(0, I) and diagonal Gaussian q_φ, the KL has a closed form.
Let q = 𝒩(μ, diag(σ²)) and p = 𝒩(0, I). Then:
KL(q ‖ p) = (1/2) ∑ⱼ ( μⱼ² + σⱼ² − log σⱼ² − 1 )
This is extremely useful: it’s exact, differentiable, and cheap.
Show the structure (intuition) #
Each latent dimension j pays a penalty:
- •μⱼ²: pushes means toward 0
- •σⱼ²: pushes variances toward 1 (too large is penalized)
- •−log σⱼ²: penalizes tiny variances (too confident)
So the encoder is encouraged to produce a distribution “not too far” from 𝒩(0,1).
Practical parameterization: log-variance #
To keep σ positive, we usually output log σ² (call it s) and compute:
σ² = exp(s), σ = exp(0.5 s)
This avoids invalid (negative) variances and tends to be numerically stable.
Summary table: what you compute in a basic VAE #
| Piece | Object | Typical choice | Role |
|---|
| Prior | p(z) | 𝒩(0, I) | Defines “sampling space” |
| Encoder | q_φ(z | x) | 𝒩(μ_φ(x), diag(σ²_φ(x))) |
| Decoder | p_θ(x | z) | Bernoulli or Gaussian |
| Objective | ELBO | 𝔼_q[log p_θ(x | z)] − KL(q ‖ p) |
| Trick | z = μ + σ ⊙ ε | ε ∼ 𝒩(0,I) | Low-variance gradients |
A note on discrete latents #
Reparameterization is straightforward for continuous distributions like Gaussians. For discrete latents, you need alternatives (Gumbel-Softmax / Concrete distributions, score-function estimators, or other variational relaxations). Many VAE lessons stop at Gaussians because they cover the most common and useful case.
Application/Connection: How VAEs Are Used (and What to Watch For) #
Generation #
After training, generation is simple:
- 1)Sample z ∼ p(z) = 𝒩(0, I)
- 2)Sample x ∼ p_θ(x|z) or take the decoder mean as a “typical” sample
Because the KL term kept q_φ(z|x) near p(z), the decoder has been trained on z values that look like prior samples.
Representation learning #
The latent z can be used as a learned feature representation. Common uses:
- •Clustering in latent space
- •Interpolation: decode points along (1−t)z₁ + tz₂
- •Downstream supervised tasks using z as input
Caution: VAEs trade off reconstruction fidelity vs. latent regularity. If the KL term dominates, representations can become less informative.
Conditional VAEs (cVAEs) #
If you want generation conditioned on labels or attributes y:
- •Prior p(z|y)
- •Decoder p_θ(x|z, y)
- •Encoder q_φ(z|x, y)
This lets you generate samples with a chosen condition.
Anomaly detection #
A common heuristic: points with low ELBO (or high reconstruction error) are considered anomalous. This works best when the model class fits normal data well.
The “posterior collapse” problem (important failure mode) #
In powerful decoders (e.g., autoregressive text decoders), the model may learn to ignore z entirely:
- •q_φ(z|x) ≈ p(z) (KL goes to 0)
- •Decoder models p_θ(x) well without needing z
Then z carries little information about x.
Common mitigations #
| Technique | What it changes | Why it helps |
|---|
| KL annealing | Slowly increase KL weight from 0 → 1 | Gives encoder time to learn informative latents |
| β-VAE | Use β · KL with β ≠ 1 | β < 1 encourages information; β > 1 encourages disentanglement |
| Free bits | KL term has a per-dim minimum | Prevents KL from collapsing too aggressively |
| Weaker decoder | Reduce decoder capacity | Forces use of z |
β-VAE: a small change with big effects #
Objective:
ELBO_β = 𝔼_q[log p_θ(x|z)] − β KL(q ‖ p)
- •β > 1: stronger pressure to match the prior → often more “disentangled” latents but blurrier samples
- •β < 1: weaker pressure → better reconstructions but latent space may be less smooth for sampling
Importance-weighted autoencoders (IWAE) #
ELBO uses one sample (or few) and is a lower bound. IWAE tightens the bound using multiple samples:
log p_θ(x) ≥ 𝔼_{z₁:K ∼ q}[ log ( (1/K) ∑ₖ p_θ(x, zₖ) / q(zₖ|x) ) ]
This can improve generative modeling but changes optimization dynamics.
Connection to diffusion models (why this node unlocks them) #
Diffusion models also involve:
- •Latent/noise variables (a sequence of noisy states)
- •Learning to reverse a corruption process
- •Using tractable training objectives that avoid directly optimizing log p(x) in closed form
Conceptually, VAEs train a generator with latent variables via a variational bound; diffusion trains a generator via denoising/score objectives. Understanding:
- •latent-variable modeling
- •KL terms and approximate inference
- •reparameterization and sampling-based gradients
makes diffusion objectives feel much less mysterious.
Worked Examples (3) #
Derive the ELBO decomposition into reconstruction − KL #
Given a latent-variable model p_θ(x, z) = p_θ(x|z) p(z) and an approximate posterior q_φ(z|x), show that ELBO = 𝔼_q[log p_θ(x|z)] − KL(q_φ(z|x) ‖ p(z)).
Start from the ELBO definition:
ELBO = 𝔼_{q_φ(z|x)}[ log p_θ(x, z) − log q_φ(z|x) ].
Substitute the joint factorization:
log p_θ(x, z) = log p_θ(x|z) + log p(z).
Plug in and separate expectations:
ELBO = 𝔼_q[ log p_θ(x|z) + log p(z) − log q(z|x) ]
= 𝔼_q[ log p_θ(x|z) ] + 𝔼_q[ log p(z) − log q(z|x) ].
Recognize the KL divergence:
KL(q(z|x) ‖ p(z)) = 𝔼_q[ log q(z|x) − log p(z) ].
Therefore:
𝔼_q[ log p(z) − log q(z|x) ] = − KL(q(z|x) ‖ p(z)).
Conclude:
ELBO = 𝔼_{q_φ(z|x)}[ log p_θ(x|z) ] − KL(q_φ(z|x) ‖ p(z)).
Insight: The ELBO cleanly separates “fit the data” (expected log-likelihood) from “keep latents well-behaved for sampling” (KL to the prior). This is the central tradeoff in VAEs.
Compute KL(q ‖ p) for diagonal Gaussians (standard VAE case) #
Let q(z) = 𝒩(μ, diag(σ²)) and p(z) = 𝒩(0, I). Derive KL(q ‖ p) = (1/2) ∑ⱼ ( μⱼ² + σⱼ² − log σⱼ² − 1 ).
Write log densities (up to constants) for d dimensions.
For p:
log p(z) = −(1/2) ∑ⱼ ( zⱼ² + log 2π ).
For q:
log q(z) = −(1/2) ∑ⱼ ( (zⱼ−μⱼ)²/σⱼ² + log σⱼ² + log 2π ).
Start from KL definition:
KL(q ‖ p) = 𝔼_q[ log q(z) − log p(z) ].
The log 2π constants cancel.
Compute the difference inside the expectation:
log q − log p
= −(1/2) ∑ⱼ [ (zⱼ−μⱼ)²/σⱼ² + log σⱼ² − zⱼ² ].
Take expectation under q. Use facts:
If zⱼ ∼ 𝒩(μⱼ, σⱼ²), then
𝔼[(zⱼ−μⱼ)²] = σⱼ²,
𝔼[zⱼ²] = Var(zⱼ) + (𝔼[zⱼ])² = σⱼ² + μⱼ².
Substitute expectations:
𝔼_q[(zⱼ−μⱼ)²/σⱼ²] = σⱼ²/σⱼ² = 1,
𝔼_q[zⱼ²] = σⱼ² + μⱼ².
So for each j:
𝔼_q[ (zⱼ−μⱼ)²/σⱼ² + log σⱼ² − zⱼ² ]
= 1 + log σⱼ² − (σⱼ² + μⱼ²).
Therefore:
KL(q ‖ p)
= −(1/2) ∑ⱼ [ 1 + log σⱼ² − σⱼ² − μⱼ² ]
= (1/2) ∑ⱼ [ μⱼ² + σⱼ² − log σⱼ² − 1 ].
Insight: This closed-form KL is why the Gaussian VAE is so popular: you get exact regularization without needing Monte Carlo estimates, and gradients are stable.
Reparameterization in practice: differentiating through a sample #
Let q_φ(z|x) = 𝒩(μ_φ(x), diag(σ²_φ(x))). Show how to rewrite an expectation 𝔼_{q_φ}[f(z)] so ∇_φ can be computed by backprop.
Define external noise ε ∼ 𝒩(0, I), independent of φ.
Construct a deterministic transform:
z = g_φ(ε, x) = μ_φ(x) + σ_φ(x) ⊙ ε.
Rewrite the expectation:
𝔼_{q_φ(z|x)}[ f(z) ] = 𝔼_{ε∼𝒩(0,I)}[ f( μ_φ(x) + σ_φ(x) ⊙ ε ) ].
Approximate with a Monte Carlo sample ε¹:
𝔼 ≈ f( μ_φ(x) + σ_φ(x) ⊙ ε¹ ).
Differentiate:
∇_φ f( μ_φ(x) + σ_φ(x) ⊙ ε¹ )
flows through μ_φ and σ_φ via the chain rule because ε¹ is treated as a constant during backprop.
Insight: Reparameterization moves the randomness to an input node (ε). Once you do that, the sampled z is just another differentiable layer in the network.
Key Takeaways #
✓
A VAE is a probabilistic latent-variable model: p_θ(x, z) = p_θ(x|z) p(z), enabling true sampling/generation.
✓
Because p_θ(z|x) is usually intractable, we introduce an encoder q_φ(z|x) to approximate the posterior.
✓
The ELBO is the tractable training objective: ELBO = 𝔼_{q_φ}[log p_θ(x|z)] − KL(q_φ(z|x) ‖ p(z)).
✓
ELBO is a lower bound on log p_θ(x) and becomes tight when q_φ(z|x) = p_θ(z|x).
✓
The reparameterization trick (Gaussian case: z = μ + σ ⊙ ε, ε ∼ 𝒩(0,I)) enables low-variance gradients through sampling.
✓
With Gaussian prior and diagonal Gaussian encoder, KL has a closed form: (1/2)∑ⱼ(μⱼ² + σⱼ² − log σⱼ² − 1).
✓
VAEs can fail via posterior collapse (KL→0, latents ignored), especially with very strong decoders; KL annealing, β-VAE, and capacity control help.
✓
Understanding latent-variable objectives, KL structure, and reparameterization provides conceptual groundwork for later generative models, including diffusion.
Common Mistakes #
✗
Treating the VAE decoder output as a deterministic reconstruction x̂ without defining a likelihood p_θ(x|z); you need a distribution to make the ELBO meaningful.
✗
Forgetting that the reconstruction term is an expectation over z ∼ q_φ(z|x); using only μ_φ(x) can work as a heuristic but changes the objective.
✗
Implementing σ directly (which can go negative) instead of parameterizing log σ² and exponentiating; this often causes numerical instability.
✗
Assuming the KL term is just a generic regularizer; it specifically matches q_φ(z|x) to the chosen prior p(z), which determines sampling behavior.
Practice #
medium
Show that log p_θ(x) = ELBO(θ, φ; x) + KL(q_φ(z|x) ‖ p_θ(z|x)).
Hint: Start from KL(q ‖ p_θ(z|x)) and substitute p_θ(z|x) = p_θ(x, z) / p_θ(x). Rearrange to isolate log p_θ(x).
Show solution
Let q = q_φ(z|x).
KL(q ‖ p_θ(z|x))
= 𝔼_q[ log q(z|x) − log p_θ(z|x) ]
= 𝔼_q[ log q(z|x) − log (p_θ(x, z) / p_θ(x)) ]
= 𝔼_q[ log q(z|x) − log p_θ(x, z) + log p_θ(x) ]
= log p_θ(x) − 𝔼_q[ log p_θ(x, z) − log q(z|x) ]
= log p_θ(x) − ELBO.
Therefore log p_θ(x) = ELBO + KL(q_φ(z|x) ‖ p_θ(z|x)).
easy
Assume p(z) = 𝒩(0, I) and q_φ(z|x) = 𝒩(μ, diag(σ²)) for a single datapoint. If μ = (2, 0) and σ² = (0.25, 4), compute KL(q ‖ p).
Hint: Use KL = (1/2)∑ⱼ(μⱼ² + σⱼ² − log σⱼ² − 1). Be careful: the formula uses σⱼ², not σⱼ.
Show solution
Compute per dimension.
j=1: μ₁² = 4, σ₁² = 0.25, log σ₁² = log 0.25 = −1.386294...
Term₁ = 4 + 0.25 − (−1.386294) − 1 = 4.636294...
j=2: μ₂² = 0, σ₂² = 4, log σ₂² = log 4 = 1.386294...
Term₂ = 0 + 4 − 1.386294 − 1 = 1.613706...
Sum = 6.25
KL = (1/2) · 6.25 = 3.125.
hard
A VAE uses Gaussian likelihood p_θ(x|z) = 𝒩(μ_θ(z), σ_x² I) with fixed σ_x². Show that maximizing 𝔼_q[log p_θ(x|z)] is equivalent (up to a constant) to minimizing 𝔼_q[‖x − μ_θ(z)‖²].
Hint: Write out the log density of a Gaussian with fixed variance and drop terms that do not depend on θ.
Show solution
For d-dimensional x,
log p_θ(x|z) = −(d/2) log(2πσ_x²) − (1/(2σ_x²)) ‖x − μ_θ(z)‖².
Take expectation over q_φ(z|x):
𝔼_q[log p_θ(x|z)]
= −(d/2) log(2πσ_x²) − (1/(2σ_x²)) 𝔼_q[ ‖x − μ_θ(z)‖² ].
The first term is constant w.r.t. θ. Therefore maximizing 𝔼_q[log p_θ(x|z)] is equivalent to minimizing 𝔼_q[ ‖x − μ_θ(z)‖² ]. (The scaling 1/(2σ_x²) does not change the optimizer when σ_x² is fixed.)
Connections #
Quality: A (4.4/5)
← back to treebrowse all →