MCMC

←Back to Tech Tree

inventorycoverage

MCMC #

Probability & StatisticsDifficulty: ★★★★☆Depth: 8Unlocks: 3

Markov Chain Monte Carlo. Sampling from complex distributions.

Interactive Visualization #

⏮◀◀▶▶STEP0.25x1xZOOM

t=0s

Core Concepts #

Key Symbols & Notation #

pi(x) - target (possibly unnormalized) density to sample from.K(x -> x') - Markov transition kernel giving the probability or density of moving from state x to state x'.

Essential Relationships #

Prerequisites (3) #

Markov Chains6 atomsMonte Carlo Methods6 atomsBayesian Inference5 atoms

Unlocks (1) #

Hierarchical Bayesian Modelslvl 5

Advanced Learning Details

Graph Position #

142

Depth Cost

3

Fan-Out (ROI)

1

Bottleneck Score

8

Chain Length

Cognitive Load #

6

Atomic Elements

42

Total Elements

L3

Percentile Level

L4

Atomic Level

All Concepts (20) #

Teaching Strategy #

Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.

In Bayesian inference you often know your posterior only up to a constant: π(x) ∝ prior(x)·likelihood(x). That “missing constant” makes direct sampling and normalization hard—exactly where Markov Chain Monte Carlo (MCMC) shines: it builds a Markov chain that spends time in regions proportional to π(x), so you can estimate expectations with samples that come from a single long run.

TL;DR:

MCMC turns sampling into a dynamical process: design a Markov transition kernel K(x → x′) whose stationary distribution is the target π(x). If the chain is ergodic, averages along the chain converge to expectations under π. Metropolis–Hastings and Gibbs sampling are the two core “proposal-and-correction” mechanisms that guarantee π is invariant while remaining practical for complex, high-dimensional targets.

What Is MCMC? #

The problem MCMC solves #

In many real problems you can evaluate a target density but can’t easily:

A canonical case is Bayesian inference. You want the posterior

π(x) = p(x | data) ∝ p(data | x) p(x)

where the proportionality constant is the evidence Z = p(data).

If your goal is an expectation

Eπ[f(X)] = ∫ f(x) π(x) dx,

then plain Monte Carlo would ask you to draw X₁, …, Xₙ i.i.d. from π. But drawing i.i.d. is often the hard part.

The key MCMC idea #

MCMC replaces i.i.d. samples with a dependent sequence

X₀, X₁, X₂, …

generated by a Markov chain with transition kernel K(x → x′). You design K so that:

  1. 1)π is invariant (stationary): if Xₜ ∼ π, then Xₜ₊₁ ∼ π.
  2. 2)The chain is ergodic: from almost any start, it eventually forgets its initial state and explores the whole relevant support.

Then you estimate expectations by time averages:

f̄ₙ = (1/n) ∑ₜ₌₁ⁿ f(Xₜ) ≈ Eπ[f(X)].

The tradeoff you accept is dependence: samples are correlated, which reduces effective sample size. The payoff is you can sample from extremely complex π—high-dimensional, constrained, or only known up to a constant.

Vocabulary: target, kernel, invariance #

π(x′) = ∫ π(x) K(x → x′) dx (continuous)

or π = πK (discrete).

In practice, you rarely verify invariance by solving the stationarity equation directly. Instead, you design K to satisfy a stronger and easier-to-check condition: detailed balance.

Detailed balance (reversibility) #

A kernel K satisfies detailed balance with π if for all x, x′:

π(x) K(x → x′) = π(x′) K(x′ → x).

If detailed balance holds and K is a valid Markov kernel, then π is invariant:

∫ π(x) K(x → x′) dx

= ∫ π(x′) K(x′ → x) dx

= π(x′) ∫ K(x′ → x) dx

= π(x′).

That last step uses that K(x′ → ·) integrates to 1.

What “sampling” means in MCMC #

Unlike i.i.d. sampling, MCMC has two phases:

  1. 1)Transient / burn-in: the chain moves from its initial distribution toward π.
  2. 2)Stationary sampling: once near stationarity, you treat subsequent states as draws (correlated) from π.

When MCMC works well, you get practical access to:

But MCMC is not “press a button.” You must reason about kernel design, ergodicity, and diagnostics. The rest of this lesson builds those pieces carefully.

Core Mechanic 1: Invariant Distributions via Proposal-and-Correction (Metropolis–Hastings) #

Why proposal-and-correction exists #

Suppose you have a target π(x) that is complicated, but you can:

If you naïvely propose X′ ∼ q(· | x) and always accept it, the chain’s stationary distribution will be whatever q induces—not your π.

Metropolis–Hastings (MH) fixes this by adding an accept/reject correction that adjusts for the mismatch between proposal dynamics and target probabilities.

Metropolis–Hastings algorithm #

Given current state x:

  1. 1)Propose x′ ∼ q(x′ | x)
  2. 2)Accept with probability

α(x, x′) = min{1, [π(x′) q(x | x′)] / [π(x) q(x′ | x)]}

  1. 3)If accepted, set Xₜ₊₁ = x′; else set Xₜ₊₁ = x.

Important: you can replace π with π̃ because the normalizing constant cancels in the ratio.

The MH transition kernel and detailed balance #

Define the proposal kernel q(x′ | x) and acceptance α(x, x′). The MH transition has two parts:

Now the crucial property: MH is constructed so that detailed balance holds.

Take x ≠ x′. Then:

π(x) K(x → x′)

= π(x) q(x′ | x) α(x, x′)

and with α defined as the min of 1 and a ratio, you can check that

π(x) q(x′ | x) α(x, x′)

= π(x′) q(x | x′) α(x′, x)

= π(x′) K(x′ → x).

Therefore π is invariant.

Intuition for the acceptance ratio #

The MH ratio compares how much the proposed move would distort the target:

r(x, x′) = [π(x′) q(x | x′)] / [π(x) q(x′ | x)].

The q terms matter because if your proposal tends to jump into some region too often, MH corrects by rejecting more there.

Special case: Random-walk Metropolis #

A very common choice is a symmetric proposal:

q(x′ | x) = q(x | x′).

For example, in ℝᵈ:

x′ = x + ε, ε ∼ 𝒩(0, σ²I).

Then the acceptance simplifies:

α(x, x′) = min{1, π(x′)/π(x)}.

This is simple and broadly applicable, but it can mix slowly in high dimensions or when π has strong correlations.

Designing proposals: local vs global moves #

Proposal choice is the main engineering degree of freedom in MH.

Proposal styleTypical q(x′x)ProsCons
Random-walkx′ = x + εSimple; needs only π̃(x)Can mix slowly; tuning σ is critical
Independenceq(x′) independent of xCan make big jumpsNeeds good global approximation to π
Langevin / gradient-informedx′ ≈ x + (step)∇log π(x) + noiseFaster mixing on smooth targetsRequires gradients; careful step-size

Even without going deep into gradient methods, the lesson is: MH correctness is easy; MH efficiency is hard.

Tuning acceptance rate (practical intuition) #

If σ is too small in a random-walk proposal, you accept almost everything but explore slowly (high autocorrelation).

If σ is too large, you propose far-away points with tiny π and reject often (stuck chain).

A good proposal balances movement and acceptance. In many problems, you tune σ so that acceptance is neither near 0 nor near 1.

What MH guarantees—and what it doesn’t #

MH guarantees π is invariant if the chain is irreducible and aperiodic on the relevant support.

But MH does not guarantee:

Those depend on proposal quality and the geometry of π.

Core Mechanic 2: Ergodicity and Why Time Averages Work #

Why ergodicity matters #

MCMC’s promise is:

(1/n) ∑ₜ₌₁ⁿ f(Xₜ) → Eπ[f(X)]

even though Xₜ are dependent.

This is not magic; it’s an ergodic theorem for Markov chains. You already know (from prerequisite Markov chains) that a stationary distribution π satisfies π = πK. Ergodicity strengthens this: the chain not only has π, it converges to π from broad initial conditions.

The big picture conditions #

Different textbooks phrase conditions differently, but the core ideas are:

  1. 1)Irreducibility: you can reach any region of the support (at least in principle).
  2. 2)Aperiodicity: you don’t get trapped in deterministic cycles.
  3. 3)Positive recurrence / stability: the chain doesn’t wander off to infinity when π is proper.

Under these conditions (plus some technical regularity), the chain has a unique stationary distribution π and converges to it.

What convergence means (distribution vs averages) #

There are two related but distinct statements:

  1. 1)Distributional convergence (mixing): as t → ∞,

Law(Xₜ) → π.

  1. 2)Ergodic averages: as n → ∞,

(1/n) ∑ₜ₌₁ⁿ f(Xₜ) → Eπ[f(X)] (almost surely, under conditions).

The second is what you use for Monte Carlo estimates.

Autocorrelation and effective sample size #

Because samples are correlated, variance decays slower than 1/n.

Let μ = Eπ[f(X)] and consider the estimator f̄ₙ.

Under standard conditions, a Markov chain CLT gives:

√n (f̄ₙ − μ) ⇒ 𝒩(0, σ²_f),

where

σ²_f = Varπ(f(X₀)) + 2 ∑_{k=1}^∞ Covπ(f(X₀), f(X_k)).

Define the integrated autocorrelation time:

τ_int = 1 + 2 ∑_{k=1}^∞ ρ_k,

where ρ_k are autocorrelations. Then roughly:

Var(f̄ₙ) ≈ (Varπ(f)/n) · τ_int.

This motivates an effective sample size:

n_eff ≈ n / τ_int.

So MCMC accuracy depends not just on n, but on how quickly the chain forgets (how fast ρ_k decays).

Burn-in: when and why #

If you start at X₀ far from typical regions of π, early samples are biased toward the initial state. The usual operational fix is to discard the first B iterations.

Conceptually:

If the chain moves slowly, discarding more samples doesn’t create independence; it just wastes computation.

Thinning: usually not necessary #

Thinning keeps every m-th sample. People do it to “reduce correlation,” but correlation is not removed; you just throw away information.

Thinning can be useful when:

Otherwise, keeping all samples and estimating τ_int is generally better.

Diagnostics (the practical side of ergodicity) #

Ergodicity is a theorem, but you still need to check behavior.

Common checks:

None of these is a proof of convergence, but they catch many failures.

A warning about multimodality #

If π has widely separated modes, a random-walk MH chain may be ergodic in theory but practically non-mixing on feasible timescales.

In such cases you may need:

The lesson here: ergodicity is about infinite time; efficiency is about the time you can afford.

Core Mechanic 3: Gibbs Sampling (Conditionals as Perfect Local Moves) #

Why Gibbs exists #

Metropolis–Hastings is general but needs a good proposal q(x′ | x). Sometimes you don’t have a good global proposal, but you do know and can sample from conditional distributions.

Gibbs sampling exploits a simple idea:

If you can sample exactly from the conditional of one variable given the others, you can construct a Markov chain that leaves the joint target π invariant.

This is especially common in Bayesian models with conjugacy.

Setup: a multivariate target #

Let x = (x₁, …, x_d) and target π(x₁, …, x_d). Suppose you can sample from each full conditional:

π(x_i | x_{−i})

where x_{−i} denotes all coordinates except i.

Gibbs update rule #

A single-site Gibbs step updates one coordinate at a time:

For i = 1, …, d:

x_i ← draw from π(x_i | x_{−i}).

After sweeping through all coordinates, you have one full Gibbs iteration.

Why Gibbs leaves π invariant (intuition + structure) #

Each conditional update is a Markov kernel K_i that replaces x_i while leaving the others fixed.

If you start with X ∼ π, and then resample X_i from π(x_i | X_{−i}), the resulting joint distribution is still π.

One way to see this is that π factors as:

π(x) = π(x_i | x_{−i}) π(x_{−i}).

So if x_{−i} is distributed as π(x_{−i}) and you sample x_i from the correct conditional, you reconstruct the correct joint.

Composing kernels K = K_d ∘ … ∘ K_1 preserves invariance.

Gibbs as MH with acceptance = 1 #

A useful connection: Gibbs is a special case of MH.

If you propose x′ that differs from x only in coordinate i, with proposal density

q(x′ | x) = π(x′_i | x_{−i})

then the MH acceptance ratio becomes 1, because the proposal already matches the target’s conditional structure.

So Gibbs is “perfectly corrected” by construction.

Block Gibbs #

Sometimes coordinates are strongly coupled; single-site updates mix slowly.

Block Gibbs updates a subset (a block) b at once:

x_b ← draw from π(x_b | x_{−b}).

When you can sample the block conditional, this can greatly reduce autocorrelation.

When Gibbs is hard #

Gibbs requires the ability to sample from conditionals. When conditionals are not standard distributions, you may combine:

This hybrid is extremely common in applied Bayesian computation.

Practical comparison: MH vs Gibbs #

AspectMetropolis–HastingsGibbs
Requiresunnormalized π̃(x) and a proposal qability to sample from conditionals
Accept/rejectyesno (acceptance = 1)
Tuningproposal scale/shapeblock structure/order
Weaknesscan reject oftencan mix slowly with strong correlations

The important mental model: MH is about proposing then correcting; Gibbs is about choosing proposals that need no correction.

Application/Connection: MCMC for Bayesian Posterior Estimation (and How to Use Samples) #

Why MCMC is central in Bayesian inference #

Bayesian workflows often reduce to computing expectations under the posterior:

If you have samples θ¹, …, θⁿ ∼ (approximately) p(θ | data), then:

MCMC turns integrals into averages.

A standard workflow #

  1. 1)Model specification
  1. 2)Choose an MCMC kernel
  1. 3)Run multiple chains
  1. 4)Check diagnostics
  1. 5)Compute posterior summaries

How to estimate uncertainty of estimates #

Because samples are correlated, you should not use i.i.d. standard errors blindly.

Two common approaches:

Constraints, discrete variables, and complex supports #

MCMC is flexible about support constraints.

The key is always the same: define a kernel K that doesn’t leave the support and keeps π invariant.

A geometric intuition (why reparameterization matters) #

If π(θ) has strong correlations, a random-walk proposal in the raw coordinates can be inefficient.

Example: suppose (θ₁, θ₂) posterior mass lies near a tilted ellipse. Proposing independent Gaussian noise in each coordinate moves you orthogonally to the ridge often, causing rejections or slow diffusion.

A reparameterization that “whitens” the posterior (makes it closer to spherical) can drastically reduce τ_int.

What this node unlocks #

Once you internalize (1) invariance via detailed balance and (2) ergodicity for time averages, you’re ready for:

But even at this level, you can already implement correct samplers for many Bayesian models and reason about when they will (or won’t) work.

Worked Examples (3) #

Metropolis–Hastings on a 1D Unnormalized Target (acceptance uses only ratios) #

Target on ℝ: π(x) ∝ π̃(x) where π̃(x) = exp(−x⁴ + 2x²). This is not trivial to normalize, but we can evaluate π̃(x) pointwise.

Use a symmetric random-walk proposal: x′ = x + ε, ε ∼ 𝒩(0, σ²).

  1. Because the proposal is symmetric, q(x′|x) = q(x|x′), so the MH acceptance probability simplifies:

    α(x, x′) = min{1, π(x′)/π(x)} = min{1, π̃(x′)/π̃(x)}.

  2. Compute the unnormalized log density (for numerical stability):

    log π̃(x) = −x⁴ + 2x².

  3. Given current x and proposal x′, form the log acceptance ratio:

    log r = log π̃(x′) − log π̃(x)

    = [−(x′)⁴ + 2(x′)²] − [−x⁴ + 2x²].

  4. Convert back to an acceptance probability:

    r = exp(log r),

    α = min{1, r}.

  5. Decision rule:

    • •Draw u ∼ Uniform(0, 1).
    • •If u ≤ α, accept (next state is x′).
    • •Else reject (next state stays x).
  6. Key observation: the normalizing constant Z = ∫ π̃(x) dx never appears. Any π(x) = π̃(x)/Z gives the same acceptance ratio because Z cancels:

    π(x′)/π(x) = [π̃(x′)/Z] / [π̃(x)/Z] = π̃(x′)/π̃(x).

Insight: Metropolis–Hastings converts “I can evaluate π̃(x) up to a constant” into “I can sample from π.” The entire correction happens through local ratios, so unknown normalization is not a blocker.

Deriving the MH Acceptance Ratio from Detailed Balance #

You have a proposal kernel q(x′|x). You want to construct a corrected kernel K(x→x′) = q(x′|x)α(x,x′) (for x′≠x) such that detailed balance holds with π.

  1. Start from the desired detailed balance condition for x′ ≠ x:

    π(x) q(x′|x) α(x,x′) = π(x′) q(x|x′) α(x′,x).

  2. Rearrange to relate α(x,x′) and α(x′,x):

    α(x,x′) / α(x′,x) = [π(x′) q(x|x′)] / [π(x) q(x′|x)] = r(x,x′).

  3. We need a choice of α that:

    • •stays in [0,1]
    • •satisfies the ratio constraint above.
  4. A standard symmetric construction is:

    α(x,x′) = min{1, r(x,x′)}

    α(x′,x) = min{1, 1/r(x,x′)}.

  5. Check the ratio in two cases.

    Case 1: r ≥ 1.

    Then α(x,x′) = 1 and α(x′,x) = 1/r.

    So α(x,x′)/α(x′,x) = 1 / (1/r) = r.

    Case 2: r < 1.

    Then α(x,x′) = r and α(x′,x) = 1.

    So α(x,x′)/α(x′,x) = r / 1 = r.

  6. Thus detailed balance holds for x′ ≠ x. Then π is invariant for the full kernel once we include the self-loop probability (the probability of rejecting and staying at x).

Insight: The MH acceptance rule is not arbitrary; it is engineered to satisfy detailed balance while keeping α inside [0,1]. This is the simplest general-purpose way to preserve π as invariant.

Gibbs Sampling in a 2D Gaussian: Conditionals are Gaussian #

Let x = (x₁, x₂) have a bivariate normal target:

x ∼ 𝒩(0, Σ), where Σ = [[1, ρ],[ρ, 1]] with |ρ| < 1.

We will derive the full conditionals π(x₁|x₂) and π(x₂|x₁) and describe a Gibbs sampler.

  1. For a bivariate normal, the conditional distribution is normal. Specifically:

    x₁ | x₂ ∼ 𝒩( E[x₁|x₂], Var(x₁|x₂) ).

  2. Compute conditional mean and variance using standard Gaussian conditioning:

    E[x₁|x₂] = ρ x₂,

    Var(x₁|x₂) = 1 − ρ².

    Similarly:

    E[x₂|x₁] = ρ x₁,

    Var(x₂|x₁) = 1 − ρ².

  3. A single Gibbs iteration (a sweep) is:

    1. Sample x₁^{new} ∼ 𝒩(ρ x₂^{old}, 1 − ρ²)

    2. Sample x₂^{new} ∼ 𝒩(ρ x₁^{new}, 1 − ρ²)

  4. No accept/reject step is needed: each update draws exactly from the correct conditional.

  5. Mixing intuition:

    If |ρ| is close to 1, then 1 − ρ² is small, so each conditional draw is tightly concentrated around ρ times the other variable.

    That makes successive samples highly correlated (slow exploration along the thin ellipse), even though every step is accepted.

Insight: Gibbs can be “perfectly accepted” yet still mix slowly when variables are strongly coupled. Correctness (invariance) is separate from efficiency (autocorrelation).

Key Takeaways #

Common Mistakes #

Practice #

easy

You use Metropolis–Hastings with an independence proposal q(x′) (does not depend on x). Write the acceptance probability α(x,x′). Then simplify it as much as possible.

Hint: Start from α(x,x′) = min{1, [π(x′) q(x|x′)]/[π(x) q(x′|x)]}. For independence proposals, q(x′|x)=q(x′) and q(x|x′)=q(x).

Show solution

With independence proposal:

q(x′|x) = q(x′), and q(x|x′) = q(x).

So

α(x,x′) = min{1, [π(x′) q(x)] / [π(x) q(x′)]}.

This form highlights that you accept moves toward regions where π/q is larger.

medium

Suppose you run an MCMC chain and want to estimate μ = Eπ[f(X)]. You compute the empirical mean f̄ₙ. Explain (in 2–4 sentences) why Var(f̄ₙ) is larger than Varπ(f)/n, and name one method to estimate uncertainty correctly.

Hint: Think about autocorrelation: Cov(f(X₀), f(X_k)) terms appear in the variance of the sample mean.

Show solution

Because MCMC samples are correlated, the variance of the average includes lagged covariances:

Var(f̄ₙ) ≈ (Varπ(f)/n) · τ_int where τ_int = 1 + 2∑_{k≥1} ρ_k.

So the estimator behaves like it had only n_eff ≈ n/τ_int independent samples.

One correct uncertainty method is batch means (split the chain into batches and use variability of batch averages), or estimating τ_int directly and scaling Varπ(f)/n.

hard

Consider a two-variable target π(x₁, x₂). You can sample from π(x₁ | x₂) exactly, but π(x₂ | x₁) is not available in closed form. Propose a valid MCMC scheme that still leaves π invariant.

Hint: Mix Gibbs where you can with MH where you can’t: MH-within-Gibbs.

Show solution

Use a hybrid MH-within-Gibbs sampler:

  1. Update x₁ by Gibbs: sample x₁′ ∼ π(x₁ | x₂) and set x₁ ← x₁′.

  2. Update x₂ by an MH step targeting the conditional π(x₂ | x₁):

α = min{1, [π(x₁, x₂′) q(x₂ | x₂′, x₁)] / [π(x₁, x₂) q(x₂′ | x₂, x₁)] }

(equivalently using the conditional since x₁ is fixed).

Composing a Gibbs kernel (invariant) with an MH kernel (invariant) yields an overall kernel that preserves π.

Connections #

Markov Chains

Monte Carlo Methods

Bayesian Inference

Detailed Balance & Reversibility

Gibbs Sampling

Hamiltonian Monte Carlo (HMC)

Variational Inference

Quality: A (4.3/5)

← back to treebrowse all →