Law of Large Numbers

←Back to Tech Tree

inventorycoverage

Law of Large Numbers #

Probability & StatisticsDifficulty: ★★★☆☆Depth: 4Unlocks: 9

Sample mean converges to expected value as sample size grows.

Interactive Visualization #

⏮◀◀▶▶STEP0.25x1xZOOM

t=0s

Core Concepts #

Key Symbols & Notation #

X_bar (sample mean)

Essential Relationships #

Prerequisites (2) #

Expected Value5 atomsLimits5 atoms

Unlocks (2) #

Central Limit Theoremlvl 3Monte Carlo Methodslvl 4

Referenced by (2) #

Where this concept shows up in the operating-finance and personal-finance graphs.

From Business (2) #

[insuranceBusiness

LLN is why insurance works as a business: with enough policyholders, actual aggregate claims converge to E[claims], making the premium predictably sufficient](/business/insurance/)[Quality ControlBusiness

The Law of Large Numbers is the exact theorem that explains why quality-control sampling works: sample means converge to the population mean as n grows, which is precisely why polls, batch sampling, and casino house edges are reliable.](/business/quality-control/)

Advanced Learning Details

Graph Position #

45

Depth Cost

9

Fan-Out (ROI)

4

Bottleneck Score

4

Chain Length

Cognitive Load #

5

Atomic Elements

25

Total Elements

L0

Percentile Level

L3

Atomic Level

All Concepts (9) #

Teaching Strategy #

Self-serve tutorial - low prerequisites, straightforward concepts.

You flip a coin 10 times and get 8 heads. That doesn’t mean the coin is biased—small samples are noisy. But if you flip it 10,000 times, the fraction of heads becomes very hard to “keep away” from 0.5. The Law of Large Numbers (LLN) is the formal statement of that stabilizing effect: averages of many independent draws tend to settle near the expected value.

TL;DR:

For iid random variables X₁, X₂, … with finite expected value μ = E[X], the sample mean X̄ₙ = (1/n)∑ᵢ₌₁ⁿ Xᵢ converges in probability to μ as n → ∞. Practically: with enough independent data, averages become predictable (even though individual outcomes stay random).

What Is the Law of Large Numbers? #

The problem it solves (why we need it) #

Probability models talk about expected value μ = E[X]. But expectation is not something you directly observe in one experiment—you observe outcomes.

So there’s a natural question:

If I repeatedly sample from a random process, does the average of what I see approach the theoretical expectation?

LLN is the bridge between theory (μ) and data (sample averages).

The main object: the sample mean #

Let X₁, X₂, … be random variables representing repeated measurements (e.g., repeated coin flips coded as 1=heads, 0=tails; or repeated customer waiting times; or repeated sensor readings).

Define the sample mean after n samples:

X̄ₙ = (1/n) ∑ᵢ₌₁ⁿ Xᵢ

This is the quantity you compute from data.

Informal statement #

If X₁, X₂, … are iid (independent and identically distributed) and have a finite mean μ, then:

This is not because randomness “goes away,” but because positive and negative deviations tend to cancel when you average many independent samples.

Formal statement (Weak Law of Large Numbers) #

The most common version used in statistics and ML is the Weak Law of Large Numbers (WLLN):

X̄ₙ → μ in probability as n → ∞.

That sentence uses a key convergence idea. Unpacking it:

For every ε > 0,

P(|X̄ₙ − μ| > ε) → 0 as n → ∞.

Interpretation: the probability that the sample mean differs from μ by more than ε becomes tiny for large n.

What LLN is not #

LLN does not say:

A helpful intuition: averaging reduces relative noise #

If each Xᵢ has typical fluctuation scale σ (standard deviation), the average X̄ₙ has fluctuation scale roughly σ/√n. That √n in the denominator is the core intuition for why averages stabilize. (This also hints at the Central Limit Theorem you’ll unlock later.)

Two versions you might hear about #

There are multiple LLNs. Two common ones:

NameStatement (informal)Mode of convergenceTypical assumptions
Weak LLNX̄ₙ gets close to μ with high probabilityIn probabilityiid, finite variance (one sufficient condition)
Strong LLNX̄ₙ → μ “almost surely”Almost sureiid, E[

In this node we emphasize weak LLN because it connects directly to statistical guarantees and concentration-style reasoning.

Core Mechanic 1: iid Samples and Why Independence Matters #

Why “identically distributed” matters #

“Identically distributed” means every Xᵢ comes from the same distribution:

This ensures that when you average, you’re averaging comparable quantities. If the distribution changes over time (non-stationarity), the “target” expectation might drift, and the average may not settle.

Example of failure without identical distribution:

Why independence matters #

Independence is the cancellation engine.

If Xᵢ are independent, deviations above μ in one sample don’t systematically force deviations above μ in others. Over many draws, positive and negative deviations tend to balance.

If they’re correlated, then errors can reinforce instead of cancel.

A quick illustration:

The variance calculation that explains stabilization #

Assume X₁, …, Xₙ are iid with mean μ and variance Var(Xᵢ) = σ² < ∞.

Compute Var(X̄ₙ).

Start with:

X̄ₙ = (1/n) ∑ᵢ₌₁ⁿ Xᵢ

Then:

Var(X̄ₙ)

= Var((1/n) ∑ᵢ₌₁ⁿ Xᵢ)

= (1/n²) Var(∑ᵢ₌₁ⁿ Xᵢ)

Now use independence:

Var(∑ᵢ₌₁ⁿ Xᵢ) = ∑ᵢ₌₁ⁿ Var(Xᵢ) = ∑ᵢ₌₁ⁿ σ² = nσ²

So:

Var(X̄ₙ) = (1/n²)(nσ²) = σ² / n

That is the quantitative form of “averaging reduces noise.” The standard deviation of X̄ₙ is:

SD(X̄ₙ) = √Var(X̄ₙ) = σ/√n

This is why doubling your sample size does not halve your error—it shrinks like 1/√n.

Turning variance shrinkage into a probability guarantee (Chebyshev) #

To connect to “converges in probability,” we need a bound on:

P(|X̄ₙ − μ| > ε)

Chebyshev’s inequality says for any random variable Y with finite variance:

P(|Y − E[Y]| ≥ ε) ≤ Var(Y)/ε²

Apply it to Y = X̄ₙ. Note E[X̄ₙ] = μ (linearity of expectation):

P(|X̄ₙ − μ| ≥ ε)

≤ Var(X̄ₙ)/ε²

= (σ²/n)/ε²

= σ²/(nε²)

Now observe what happens as n → ∞:

σ²/(nε²) → 0

Therefore:

P(|X̄ₙ − μ| ≥ ε) → 0

This is exactly:

X̄ₙ → μ in probability.

So one clean path to LLN is:

  1. independence ⇒ Var(X̄ₙ) = σ²/n

  2. Chebyshev ⇒ probability of large deviation ≤ σ²/(nε²)

  3. RHS → 0 ⇒ convergence in probability

What conditions are actually required? #

The derivation above assumes finite variance. LLN can still hold under weaker conditions (e.g., finite mean without finite variance for strong LLN in some forms), but for many practical ML/statistics contexts, “finite variance + iid” is the standard mental model.

A practical checklist:

AssumptionWhat it buys youWhat breaks without it
Independencevariances add; cancellation of noisecorrelations can prevent shrinkage
Identical distributionstable target μdrift makes X̄ chase a moving target
Finite mean E[X]defines μno meaningful target to converge to
Finite variance Var(X)easy Chebyshev proof + ratesheavy tails can slow/complicate convergence

Core Mechanic 2: Convergence in Probability (What the Limit Actually Means) #

Why we need a new kind of “limit” #

You already know limits for numbers and functions: aₙ → a means the sequence of numbers gets arbitrarily close to a.

But X̄ₙ is a random variable. For each n, X̄ₙ is not a single number—it’s a distribution over possible sample means.

So “X̄ₙ approaches μ” must mean something like:

That is exactly what convergence in probability captures.

Definition (slow and explicit) #

We say Xₙ → c in probability if:

∀ ε > 0, P(|Xₙ − c| > ε) → 0 as n → ∞.

Key points:

Connecting to LLN #

LLN states:

X̄ₙ → μ in probability.

Meaning:

∀ ε > 0, P(|X̄ₙ − μ| > ε) → 0.

This is the statement you use when you want to justify:

Visual intuition: shrinking spread #

Imagine the sampling distribution of X̄ₙ as n increases:

So the probability mass near μ grows, and the mass far away shrinks.

Even without drawing the picture, you can think: the region (μ − ε, μ + ε) captures more and more probability.

Convergence in probability vs almost sure (high-level) #

Sometimes learners hear “converges almost surely” and assume it’s the same. It’s stronger.

In this node, the weak LLN is enough to support most statistical reasoning and to motivate Monte Carlo.

A practical reading: sample size as a knob #

Chebyshev gave:

P(|X̄ₙ − μ| ≥ ε) ≤ σ²/(nε²)

Treat it like a design inequality. Want the failure probability ≤ δ?

Require:

σ²/(nε²) ≤ δ

Solve for n:

n ≥ σ²/(δ ε²)

This is not always tight, but it teaches a key scaling law:

Even if you later use sharper inequalities (Hoeffding, Bernstein), LLN is the conceptual foundation: more samples ⇒ more reliable averages.

A note about vectors (for ML context) #

Often in ML you average vectors (e.g., average gradient estimates). If Xᵢ are iid random vectors in ℝᵈ with mean μ = E[X], then a vector-valued LLN holds under similar finite-moment conditions:

X̄ₙ = (1/n) ∑ᵢ₌₁ⁿ Xᵢμ (in probability, componentwise).

To measure error you might use a norm like ‖X̄ₙμ‖.

You don’t need the full vector LLN here, but it’s worth seeing how naturally the scalar idea generalizes.

Application/Connection: Why LLN Powers Statistics, Monte Carlo, and the CLT #

1) Statistics: why sample averages estimate expectations #

Many estimators are averages:

If you define a quantity of interest as an expectation:

μ = E[g(X)]

and you can sample X₁, …, Xₙ iid, then you can estimate μ by:

μ̂ₙ = (1/n) ∑ᵢ₌₁ⁿ g(Xᵢ)

LLN says μ̂ₙ → μ in probability (under the same style of assumptions). This is the justification for “plug in the average.”

2) Monte Carlo: turning expectations into computation #

Monte Carlo methods rely on the identity:

μ = E[g(X)]

When μ is hard to compute analytically (e.g., high-dimensional integrals), you estimate it with samples. LLN is the guarantee that:

(1/n) ∑ g(Xᵢ)

stabilizes as n grows.

Without LLN, Monte Carlo would be a gamble with no convergence story.

3) Central Limit Theorem: what LLN doesn’t tell you #

LLN tells you convergence happens, but it doesn’t describe the shape of fluctuations.

The CLT (which you’ll unlock next) refines the picture: roughly,

√n (X̄ₙ − μ)

approaches a normal distribution with variance σ².

So you can build approximate confidence intervals, p-values, and error bars.

Relationship summary:

ConceptWhat it answersTypical output
LLNDoes X̄ₙ get close to μ?Convergence guarantee
CLTHow does X̄ₙ fluctuate around μ for large n?Approximate normal distribution

4) ML connection: empirical risk minimization (ERM) #

In supervised learning, you often want to minimize expected loss:

R(θ) = E[L(θ; Z)]

But you only have data Z₁, …, Zₙ. You minimize empirical risk:

R̂ₙ(θ) = (1/n) ∑ᵢ₌₁ⁿ L(θ; Zᵢ)

LLN suggests that for a fixed θ, R̂ₙ(θ) → R(θ) in probability as n increases.

Caution: learning theory needs uniform convergence over θ, which is a deeper topic (VC dimension, Rademacher complexity). But LLN is the first stepping stone: averages over iid data approximate expectations.

5) Why the law is so widely applicable #

Because expectations are linear and averaging is simple, LLN becomes a universal tool:

Whenever you see “estimate an expectation by sampling,” LLN is the hidden backbone.

Worked Examples (3) #

Coin flips as Bernoulli variables (LLN in the simplest case) #

Let Xᵢ = 1 if the i-th flip is heads, 0 otherwise. Assume a fair coin: P(Xᵢ = 1) = 0.5. Then μ = E[Xᵢ] = 0.5. The sample mean X̄ₙ is the fraction of heads in n flips.

  1. Compute the expectation:

    E[Xᵢ] = 1·P(Xᵢ=1) + 0·P(Xᵢ=0)

    = 1·0.5 + 0·0.5

    = 0.5

    So μ = 0.5.

  2. Compute the variance of one flip:

    Var(Xᵢ) = E[Xᵢ²] − (E[Xᵢ])².

    But Xᵢ ∈ {0,1} so Xᵢ² = Xᵢ.

    Therefore E[Xᵢ²] = E[Xᵢ] = 0.5.

    So Var(Xᵢ) = 0.5 − (0.5)² = 0.5 − 0.25 = 0.25.

  3. Compute the variance of the sample mean using iid independence:

    Var(X̄ₙ) = Var((1/n)∑ᵢ₌₁ⁿ Xᵢ)

    = (1/n²) Var(∑ᵢ₌₁ⁿ Xᵢ)

    = (1/n²) ∑ᵢ₌₁ⁿ Var(Xᵢ)

    = (1/n²) · n · 0.25

    = 0.25/n.

  4. Use Chebyshev to bound the probability of deviating from 0.5:

    For any ε > 0,

    P(|X̄ₙ − 0.5| ≥ ε) ≤ Var(X̄ₙ)/ε² = (0.25/n)/ε² = 0.25/(nε²).

  5. See the LLN limit directly:

    As n → ∞, 0.25/(nε²) → 0.

    So P(|X̄ₙ − 0.5| ≥ ε) → 0.

    Hence X̄ₙ → 0.5 in probability.

Insight: Even though each flip stays maximally random, the average of many flips becomes predictable because its variance shrinks like 1/n. LLN formalizes the everyday belief that “more trials smooth out randomness.”

Monte Carlo estimation of an expectation (why simulation converges) #

Suppose you want μ = E[X²] where X ~ Uniform(0, 1). You can compute it analytically, but pretend you can’t. You sample X₁, …, Xₙ iid ~ Uniform(0,1) and estimate μ by μ̂ₙ = (1/n)∑ Xᵢ².

  1. Define the estimator as a sample mean of a transformed variable:

    Let Yᵢ = Xᵢ².

    Then μ̂ₙ = (1/n)∑ᵢ₌₁ⁿ Yᵢ.

    If the Xᵢ are iid, then the Yᵢ are also iid (same transformation applied independently).

  2. Compute the true expectation (for reference):

    μ = E[X²] = ∫₀¹ x² dx

    = [x³/3]₀¹

    = 1/3.

  3. Check the LLN conditions:

    E[Yᵢ] = E[Xᵢ²] = 1/3 is finite.

    Also Var(Yᵢ) is finite because 0 ≤ Yᵢ ≤ 1.

    So the weak LLN applies.

  4. State the LLN conclusion:

    μ̂ₙ = (1/n)∑ Yᵢ → E[Y] = 1/3 in probability as n → ∞.

    Equivalently, for every ε > 0:

    P(|μ̂ₙ − 1/3| > ε) → 0.

  5. Optional rate intuition via variance shrinkage:

    Var(μ̂ₙ) = Var(Y)/n.

    So typical error scale is SD(μ̂ₙ) = √Var(Y)/√n.

    This explains why Monte Carlo accuracy improves like 1/√n.

Insight: Monte Carlo is not magic—it’s “just” LLN applied to a clever choice of Y = g(X). Once you can sample X, you can estimate E[g(X)] by averaging g(Xᵢ), and LLN guarantees stabilization.

A dependence counterexample: repeating the same measurement doesn’t average out #

Let X be a random variable with E[X] = μ and Var(X) = σ² > 0. Define X₁ = X₂ = … = Xₙ = X (perfect dependence). Consider X̄ₙ.

  1. Compute the sample mean:

    X̄ₙ = (1/n)∑ᵢ₌₁ⁿ Xᵢ = (1/n) · nX = X.

  2. Compute deviation probability:

    For any ε > 0,

    P(|X̄ₙ − μ| > ε) = P(|X − μ| > ε).

    This probability does not depend on n.

  3. Conclude no convergence in probability (unless X is degenerate):

    Since P(|X̄ₙ − μ| > ε) stays constant with n, it does not go to 0.

    Therefore X̄ₙ does not converge to μ in probability.

Insight: LLN is not “about large n” alone. It is about independent information. If your samples are fully dependent, n is an illusion—you did not actually collect more evidence.

Key Takeaways #

Common Mistakes #

Practice #

easy

Let X₁, X₂, … be iid with E[Xᵢ] = 10 and Var(Xᵢ) = 9. Use Chebyshev to find an n such that P(|X̄ₙ − 10| ≥ 1) ≤ 0.05.

Hint: Use P(|X̄ₙ − μ| ≥ ε) ≤ σ²/(nε²) with σ² = 9, ε = 1, and set the RHS ≤ 0.05.

Show solution

Chebyshev:

P(|X̄ₙ − 10| ≥ 1) ≤ Var(X̄ₙ)/1².

Var(X̄ₙ) = σ²/n = 9/n.

Require 9/n ≤ 0.05 ⇒ n ≥ 9/0.05 = 180.

So n = 180 suffices (or any larger n).

medium

Let Xᵢ be iid Bernoulli(p): P(Xᵢ=1)=p, P(Xᵢ=0)=1−p. Show that X̄ₙ converges in probability to p using the variance + Chebyshev approach.

Hint: Compute E[Xᵢ] and Var(Xᵢ)=p(1−p). Then compute Var(X̄ₙ) and apply Chebyshev.

Show solution

Compute expectation:

E[Xᵢ] = 1·p + 0·(1−p) = p.

So μ = p.

Compute variance:

Since Xᵢ² = Xᵢ,

Var(Xᵢ) = E[Xᵢ²] − (E[Xᵢ])² = E[Xᵢ] − p² = p − p² = p(1−p).

Let σ² = p(1−p).

Variance of the mean (iid independence):

Var(X̄ₙ) = σ²/n = p(1−p)/n.

Chebyshev:

P(|X̄ₙ − p| ≥ ε) ≤ Var(X̄ₙ)/ε² = p(1−p)/(nε²).

As n → ∞, p(1−p)/(nε²) → 0.

Thus P(|X̄ₙ − p| ≥ ε) → 0 for every ε > 0, i.e., X̄ₙ → p in probability.

hard

You collect data from a process with strong positive correlation (e.g., Xᵢ = Z + noiseᵢ where Z is a shared random offset). Explain qualitatively how this can slow or break the stabilization of X̄ₙ compared to the iid case.

Hint: Think about what happens if many samples share the same random component; averaging cancels independent noise but not shared noise.

Show solution

If Xᵢ share a common random offset Z, then each sample contains the same source of randomness. Averaging reduces only the independent part (the noiseᵢ), but the shared part Z does not cancel because it appears in every term.

For example, if Xᵢ = Z + εᵢ with E[Z]=0 and εᵢ iid with E[εᵢ]=0, then

X̄ₙ = Z + (1/n)∑ εᵢ.

As n → ∞, (1/n)∑ εᵢ → 0 (by LLN for εᵢ), but Z remains. So X̄ₙ converges to Z, not to 0, and the limiting value is still random. This shows why independence (or at least weak dependence) is crucial for LLN-style stabilization.

Connections #

Next nodes you can study:

Related background nodes to review:

Quality: A (4.6/5)

← back to treebrowse all →