Bias-Variance Tradeoff

←Back to Tech Tree

inventorycoverage

Bias-Variance Tradeoff #

Machine LearningDifficulty: ★★★★☆Depth: 9Unlocks: 2

Decomposition of prediction error. Underfitting vs overfitting.

Interactive Visualization #

⏮◀◀▶▶STEP0.25x1xZOOM

t=0s

Core Concepts #

Key Symbols & Notation #

f_hat(x) - the learned/predicted function at input xf(x) - the true target function at input x

Essential Relationships #

Prerequisites (2) #

Expected Value5 atomsMachine Learning Introduction5 atoms

Unlocks (2) #

Cross-Validationlvl 4Ensemble Methodslvl 4

Referenced by (1) #

Where this concept shows up in the operating-finance and personal-finance graphs.

From Business (1) #

[failure modeBusiness

The bias-variance decomposition is the canonical ML framework for taxonomizing failure modes - it decomposes prediction error into two specific, named failure modes (underfitting vs overfitting) and prescribes different interventions for each, which is the mathematical foundation for the practice of identifying and encoding distinct failure modes into corrective frameworks](/business/failure-mode/)

Advanced Learning Details

Graph Position #

112

Depth Cost

2

Fan-Out (ROI)

2

Bottleneck Score

9

Chain Length

Cognitive Load #

6

Atomic Elements

36

Total Elements

L2

Percentile Level

L4

Atomic Level

All Concepts (14) #

Teaching Strategy #

Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.

Two models can have the same training error and wildly different test error. The bias–variance tradeoff explains why: learning is a tug-of-war between systematic error (bias) and sensitivity to data (variance), with an unavoidable floor set by noise.

TL;DR:

At a fixed input x, the expected squared prediction error decomposes as

E[(ŷ(x) − y)²] = (Bias[ŷ(x)])² + Var(ŷ(x)) + σ².

Bias measures how far the average learned prediction is from the true function f(x). Variance measures how much the learned prediction changes across different training sets. σ² is irreducible noise from the data-generating process.

What Is Bias–Variance Tradeoff? #

Why this concept exists #

In supervised learning, you don’t just want to fit the data you already saw—you want to predict new data. The uncomfortable part is that training data is only one random draw from many possible datasets you could have received. If you trained the same algorithm again on a different sample, you’d typically get a different predictor.

So when you ask “How good is my model?” you really mean something like:

what error should I expect?

Bias–variance is the language for answering that. It separates error into:

  1. 1)Noise (irreducible): randomness in y that no model can predict perfectly.
  2. 2)Bias (systematic error): the model class (and training procedure) tends to miss the true relationship in a consistent direction.
  3. 3)Variance (instability): the learned model changes a lot when the training set changes.

This is what people mean by underfitting vs overfitting:

The setting (what’s random?) #

To make this precise, we assume a data-generating process:

y = f(x) + ε

where:

A learning algorithm takes a training set D and produces a predictor:

f̂_D(x)

We will often shorten this to f̂(x), but it’s important to remember:

What “bias” and “variance” mean at a single x #

Fix an input x. Across many possible training sets, you get many learned predictions:

f̂_D₁(x), f̂_D₂(x), …

Define:

Bias(x) = E_D[f̂(x)] − f(x)

Var(x) = Var_D(f̂(x)) = E_D[(f̂(x) − E_D[f̂(x)])²]

Bias is about the center of the cloud of possible learned predictions. Variance is about the spread of that cloud.

Big picture: one input x vs overall performance #

The classic decomposition is at a fixed x. To get a single number for a model, you typically average over x as well:

E_x[E_D[(f̂(x) − y)²]]

But the intuition is cleanest at one x first: bias and variance can differ across regions of input space. A model might be stable (low variance) where data is plentiful and unstable (high variance) near the edges.

A quick comparison table #

TermDepends on training set D?Depends on noise ε?What it measuresCan we reduce it?
(Bias(x))²via E_D[f̂(x)]nosystematic mismatch from f(x)yes (richer model, better features, less regularization)
Var(x)yesnosensitivity to dataset samplingyes (more data, regularization, bagging/ensembles)
σ² (noise)noyesinherent randomness in y given xnot without changing measurement process

Bias–variance tradeoff is the practical reality that many interventions that reduce one can increase the other.

Core Mechanic 1: The Error Decomposition at Input x #

Why decompose error? #

When you measure squared error, it’s hard to tell why the error is happening. Is the model too simple? Too wiggly? Is the data noisy? The decomposition tells you how much error comes from each cause—at least conceptually.

We’ll derive the standard result:

E_D,ε[(f̂(x) − y)²] = (Bias(x))² + Var(x) + σ²

where y = f(x) + ε.

Step 0: write the prediction error using the data model #

Start from:

(f̂(x) − y)²

Substitute y = f(x) + ε:

(f̂(x) − f(x) − ε)²

Now take expectation over both sources of randomness:

So we consider:

E_D,ε[(f̂(x) − f(x) − ε)²]

Step 1: separate the noise term #

Expand the square:

(f̂(x) − f(x) − ε)²

= (f̂(x) − f(x))² − 2ε(f̂(x) − f(x)) + ε²

Take expectation:

E[(f̂(x) − f(x))²] − 2E[ε(f̂(x) − f(x))] + E[ε²]

Now use two key assumptions:

  1. 1)E[ε] = 0
  2. 2)ε is independent of D, hence independent of f̂(x) (since f̂ is computed from D)

So:

E[ε(f̂(x) − f(x))] = E[ε] · E[f̂(x) − f(x)] = 0

And E[ε²] = Var(ε) = σ².

Thus:

E_D,ε[(f̂(x) − y)²] = E_D[(f̂(x) − f(x))²] + σ²

So the “interesting” part is E_D[(f̂(x) − f(x))²].

Step 2: add and subtract the mean predictor #

Let μ(x) = E_D[f̂(x)]. Add and subtract μ(x):

f̂(x) − f(x)

= (f̂(x) − μ(x)) + (μ(x) − f(x))

Square it:

(f̂(x) − f(x))²

= (f̂(x) − μ(x) + μ(x) − f(x))²

Expand:

= (f̂(x) − μ(x))²

Now take expectation over D.

The last term is constant with respect to D:

E_D[(μ(x) − f(x))²] = (μ(x) − f(x))²

The first term is the variance definition:

E_D[(f̂(x) − μ(x))²] = Var_D(f̂(x))

The middle term disappears because E_D[f̂(x) − μ(x)] = 0:

E_D[2(f̂(x) − μ(x))(μ(x) − f(x))]

= 2(μ(x) − f(x)) E_D[f̂(x) − μ(x)]

= 2(μ(x) − f(x)) · 0

= 0

So we get:

E_D[(f̂(x) − f(x))²]

= Var_D(f̂(x)) + (μ(x) − f(x))²

But μ(x) − f(x) is exactly Bias(x). Therefore:

E_D[(f̂(x) − f(x))²] = Var(x) + (Bias(x))²

Step 3: combine with noise #

Recall:

E_D,ε[(f̂(x) − y)²] = E_D[(f̂(x) − f(x))²] + σ²

So:

E_D,ε[(f̂(x) − y)²]

= (Bias(x))² + Var(x) + σ²

What this decomposition does (and does not) say #

This is a conceptual decomposition of expected test error at x. It does not mean you can directly look at your dataset and perfectly measure bias and variance without extra assumptions. But it does tell you:

Why squared error matters #

This exact decomposition is for squared loss (and closely related losses). The nice algebra comes from expanding squares. For other losses (like 0–1 classification error), there are analogs but not such a clean three-term identity.

A brief geometric intuition #

At fixed x, imagine f̂(x) as a point on the number line that changes with D.

Core Mechanic 2: How Model Complexity and Data Control Bias and Variance #

Why a “tradeoff” appears #

You might hope to reduce bias and variance simultaneously. Sometimes you can (e.g., better features, more data). But often you face a real tension:

This is not a law of nature for every method, but it is a persistent pattern in many learning setups.

A mental experiment: retraining on different datasets #

Fix x. Suppose you repeatedly sample datasets D₁, D₂, … and train the same algorithm.

But rigidity often means the model cannot represent f well → higher bias.

Underfitting vs overfitting through the decomposition #

It helps to connect the decomposition to the typical learning-curve story.

A common misconception is that “overfitting means the model is too complex.” Complexity is one route to variance, but variance is about sensitivity, not about complexity alone.

What knobs change bias and variance? #

Below is a practical map. Effects can depend on the algorithm and data regime, but these are reliable first approximations.

InterventionTypical effect on BiasTypical effect on VarianceNotes
Increase model capacity (more parameters, higher-degree polynomial, deeper tree)Lower bias, higher variance risk
Increase regularization (L2/L1, pruning, dropout)Makes solutions more stable
More training data↔ or ↓Usually reduces variance strongly
Better features / representation↔ or ↓Can reduce bias without increasing variance much
Early stopping (iterative learners)Acts like regularization
Bagging / averaging multiple modelsReduces variance by averaging
Boosting (often)can ↑ or ↓Often reduces bias; variance behavior depends

A concrete anchor: k-nearest neighbors (KNN) #

KNN is a classic bias–variance illustration.

Here the “complexity knob” is k (smaller k → more flexible).

Another anchor: polynomial regression #

Suppose you fit y as a polynomial of degree d.

If you also add regularization, you can increase d (potentially reduce bias) while controlling variance.

Why variance falls with averaging (ensembles) #

A useful fact: averaging reduces variance when errors are not perfectly correlated.

Let Z₁, …, Z_M be random predictions (at x) from M models with the same mean and variance.

Define the average prediction:

Z̄ = (1/M) ∑ᵢ Zᵢ

If the Zᵢ are independent with Var(Zᵢ) = v, then:

Var(Z̄) = Var((1/M) ∑ᵢ Zᵢ)

= (1/M²) ∑ᵢ Var(Zᵢ)

= (1/M²) · M · v

= v/M

In practice, models are correlated, so you don’t get v/M, but you still often get a meaningful reduction. This is the core reason bagging and random forests reduce variance.

Bias and variance vary over x #

A final subtlety worth breathing room:

This explains why some models behave nicely “in the middle” but become unstable near the boundaries of the input domain.

Application/Connection: Diagnosing Generalization and Choosing Evaluation Tools #

Why this matters in practice #

You rarely get to compute Bias(x), Var(x), or σ² directly. What you can do is:

Bias–variance tradeoff is a diagnostic story that guides which lever to pull next.

From decomposition to workflow #

A common iterative workflow:

  1. 1)Start simple and get a baseline.
  2. 2)Check training vs validation error.
  3. 3)If both are high → likely high bias.
  4. 4)If training is low but validation is high → likely high variance.
  5. 5)Adjust:

This is not a proof—just a principled heuristic.

Where cross-validation fits #

Cross-validation (CV) is a way to approximate the expectation over datasets D.

Remember: variance is about how much f̂ changes when D changes. CV does something related:

While CV doesn’t directly compute Var_D(f̂(x)), it approximates expected generalization error, which includes both bias and variance effects.

In other words, CV is your practical handle on the left-hand side:

E_D,ε[(f̂(x) − y)²]

averaged over x.

Where ensembles fit #

Ensembles (bagging, random forests) are practical tools to reduce variance:

This links directly to the variance term in the decomposition.

The irreducible noise term and “Bayes error” intuition #

Even with a perfect predictor f(x), your expected squared error is still σ².

This matters when you’re stuck:

In classification, the analogous notion is that if classes overlap intrinsically, there is a minimum achievable error (often called Bayes error). The exact decomposition differs, but the same moral applies: some uncertainty is built into the world.

A note on squared bias #

The decomposition uses (Bias(x))², not Bias(x). This means:

Practical signals (imperfect but useful) #

ObservationLikely issueTypical next move
Training error high; validation error highhigh biasincrease capacity, reduce regularization, improve features
Training error low; validation error highhigh variancemore data, regularization, bagging/ensemble, simplify
Training error low; validation error lowgood balanceconsider whether noise limits further gains
Training error high; validation error lower (rare)training procedure mismatchcheck data leakage, metric, preprocessing, optimization

Closing the loop #

Bias–variance is not just theory. It’s the bridge between:

And it motivates the next nodes you’ll unlock:

Worked Examples (3) #

Example 1: Compute bias², variance, and noise at a single x from repeated training #

Suppose the true function value at a particular input is f(x) = 2.

You train the same learning algorithm on many independent training sets, and you observe the learned prediction at this x is a random variable f̂(x) that takes values:

Assume the observation noise in the test label is ε with E[ε] = 0 and Var(ε) = σ² = 1, so y = f(x) + ε.

Compute Bias(x), (Bias(x))², Var(x), and the expected test MSE at x.

  1. Compute the expected prediction μ(x) = E_D[f̂(x)]:

    μ(x) = (1)(1/2) + (3)(1/2)

    = (1/2) + (3/2)

    = 2

  2. Compute Bias(x) = μ(x) − f(x):

    Bias(x) = 2 − 2 = 0

    So (Bias(x))² = 0² = 0

  3. Compute Var(x) = E[(f̂(x) − μ(x))²]:

    Possible deviations from μ = 2 are:

    • •if f̂ = 1, then (1 − 2)² = 1
    • •if f̂ = 3, then (3 − 2)² = 1

    Thus:

    Var(x) = (1)(1/2) + (1)(1/2) = 1

  4. Use the decomposition:

    E[(f̂(x) − y)²] = (Bias(x))² + Var(x) + σ²

    = 0 + 1 + 1

    = 2

Insight: Even though the predictor is unbiased at this x (bias = 0), it is unstable (variance = 1). Averaging multiple such predictors (an ensemble) could reduce the variance term and improve test error, but you can’t beat the noise floor σ² = 1.

Example 2: Derive the decomposition step-by-step (showing where the cross-term vanishes) #

Assume y = f(x) + ε with E[ε] = 0 and Var(ε) = σ², and ε is independent of the training set D.

Let f̂(x) be the learned predictor from D.

Show that:

E_D,ε[(f̂(x) − y)²] = (E_D[f̂(x)] − f(x))² + Var_D(f̂(x)) + σ².

  1. Start with the test squared error:

    E[(f̂(x) − y)²]

    Substitute y = f(x) + ε:

    E[(f̂(x) − f(x) − ε)²]

  2. Expand the square:

    (f̂ − f − ε)²

    = (f̂ − f)² − 2ε(f̂ − f) + ε²

    Take expectation:

    E[(f̂ − f)²] − 2E[ε(f̂ − f)] + E[ε²]

  3. Show the middle term is 0:

    Because ε is independent of D (hence of f̂) and E[ε] = 0,

    E[ε(f̂ − f)] = E[ε]·E[f̂ − f] = 0

  4. So the expression becomes:

    E[(f̂ − f)²] + E[ε²]

    And E[ε²] = Var(ε) = σ²

  5. Now decompose E[(f̂ − f)²] by adding and subtracting μ = E[f̂]:

    Let μ = E_D[f̂(x)]. Then

    f̂ − f = (f̂ − μ) + (μ − f)

    Square:

    (f̂ − f)² = (f̂ − μ)² + 2(f̂ − μ)(μ − f) + (μ − f)²

  6. Take expectation over D:

    E[(f̂ − μ)²] + 2(μ − f)E[f̂ − μ] + (μ − f)²

    But E[f̂ − μ] = 0, so the cross-term vanishes.

  7. Recognize terms:

    E[(f̂ − μ)²] = Var_D(f̂(x))

    (μ − f) = Bias(x)

    So E[(f̂ − f)²] = Var_D(f̂(x)) + (Bias(x))²

  8. Combine with noise:

    E[(f̂ − y)²] = (Bias(x))² + Var_D(f̂(x)) + σ²

Insight: The entire decomposition hinges on two ideas: (1) squared loss lets you expand and regroup terms cleanly, and (2) the cross-term disappears because deviations around the mean have zero expectation.

Example 3: How averaging two correlated models affects variance #

At a fixed x, suppose two learned models produce predictions Z₁ and Z₂ with:

E[Z₁] = E[Z₂] = m,

Var(Z₁) = Var(Z₂) = v,

Corr(Z₁, Z₂) = ρ.

Let the ensemble prediction be Z̄ = (Z₁ + Z₂)/2.

Compute Var(Z̄) in terms of v and ρ.

  1. Use the variance formula:

    Var(Z̄) = Var((Z₁ + Z₂)/2)

    = (1/4) Var(Z₁ + Z₂)

  2. Expand Var(Z₁ + Z₂):

    Var(Z₁ + Z₂) = Var(Z₁) + Var(Z₂) + 2Cov(Z₁, Z₂)

    = v + v + 2Cov(Z₁, Z₂)

    = 2v + 2Cov(Z₁, Z₂)

  3. Convert correlation to covariance:

    Corr(Z₁, Z₂) = ρ = Cov(Z₁, Z₂) / (√v √v) = Cov(Z₁, Z₂) / v

    So Cov(Z₁, Z₂) = ρv

  4. Substitute:

    Var(Z₁ + Z₂) = 2v + 2ρv = 2v(1 + ρ)

  5. Thus:

    Var(Z̄) = (1/4) · 2v(1 + ρ)

    = (v/2)(1 + ρ)

Insight: Averaging helps most when models are less correlated. If ρ = 1 (perfectly correlated), Var(Z̄) = v (no gain). If ρ = 0, Var(Z̄) = v/2. Random forests work partly by reducing correlation between trees, making variance reduction from averaging more effective.

Key Takeaways #

Common Mistakes #

Practice #

easy

At a fixed x, suppose f(x) = 5 and the learned predictor satisfies E_D[f̂(x)] = 4 with Var_D(f̂(x)) = 2. The noise variance is σ² = 3. Compute the expected test MSE at x under squared loss.

Hint: Use E[(f̂ − y)²] = (E[f̂] − f)² + Var(f̂) + σ².

Show solution

Bias(x) = E[f̂(x)] − f(x) = 4 − 5 = −1

(Bias(x))² = 1

Var(x) = 2

σ² = 3

So expected MSE = 1 + 2 + 3 = 6.

hard

You average M predictors at a fixed x: Z̄ = (1/M)∑ᵢ Zᵢ. Assume each has Var(Zᵢ) = v and pairwise correlation Corr(Zᵢ, Zⱼ) = ρ for i ≠ j. Derive Var(Z̄).

Hint: Use Var(∑ Zᵢ) = ∑ Var(Zᵢ) + 2∑_{i<j} Cov(Zᵢ, Zⱼ) and Cov = ρv.

Show solution

Var(Z̄) = Var((1/M)∑ᵢ Zᵢ) = (1/M²)Var(∑ᵢ Zᵢ)

Var(∑ᵢ Zᵢ) = ∑ᵢ Var(Zᵢ) + 2∑_{i<j} Cov(Zᵢ, Zⱼ)

= Mv + 2·(number of pairs)·(ρv)

Number of pairs = M(M−1)/2

So Var(∑ᵢ Zᵢ) = Mv + 2·(M(M−1)/2)·ρv = Mv + M(M−1)ρv

Therefore:

Var(Z̄) = (1/M²)[Mv + M(M−1)ρv]

= (v/M) + ((M−1)/M)ρv

= v( (1−ρ)/M + ρ )

Checks: if ρ = 0 → v/M; if ρ = 1 → v.

medium

A learner compares two models A and B. Model A has higher training error but similar validation error compared to B; model B has very low training error but noticeably worse validation error. Using bias–variance language, diagnose A vs B and propose one concrete change to improve each.

Hint: Think: high bias shows up as high training and validation error; high variance shows up as low training but high validation error.

Show solution

Model A: higher training error suggests it may be underfitting (higher bias). If validation error is similar to B, A may be too simple or too regularized. Improvement: increase capacity (e.g., more features, deeper model) or reduce regularization.

Model B: very low training error but worse validation error indicates overfitting (higher variance). Improvement: add regularization (e.g., L2, dropout, pruning), simplify the model, gather more data, or use bagging/ensembling.

Connections #

Cross-Validation

Ensemble Methods

Regularization (Ridge/Lasso)

Learning Curves

Decision Trees

Quality: A (4.5/5)

← back to treebrowse all →