Ensemble Methods

←Back to Tech Tree

inventorycoverage

Ensemble Methods #

Machine LearningDifficulty: ★★★★☆Depth: 10Unlocks: 0

Combining multiple models. Bagging, boosting, random forests.

Interactive Visualization #

⏮◀◀▶▶STEP0.25x1xZOOM

t=0s

Core Concepts #

Key Symbols & Notation #

H(x) - ensemble predictionh_i(x) - prediction from the i-th base learner

Essential Relationships #

Prerequisites (2) #

Decision Trees6 atomsBias-Variance Tradeoff6 atoms

Advanced Learning Details

Graph Position #

129

Depth Cost

0

Fan-Out (ROI)

0

Bottleneck Score

10

Chain Length

Cognitive Load #

6

Atomic Elements

60

Total Elements

L4

Percentile Level

L4

Atomic Level

All Concepts (25) #

Teaching Strategy #

Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.

A single decision tree can be brilliant—and also wildly unstable. Ensemble methods tame that instability (and sometimes push accuracy beyond what any single model can reach) by combining many “weak” or “noisy” models into one strong predictor.

TL;DR:

Ensemble methods build a final predictor H(x) by combining base learners hᵢ(x). Bagging trains many models in parallel on bootstrap-resampled data and averages/votes to reduce variance (e.g., random forests). Boosting trains models sequentially, each focusing on previous mistakes, and combines them as a weighted sum to reduce bias (e.g., AdaBoost, Gradient Boosting).

What Is an Ensemble Method? #

Why ensembles exist (motivation) #

Many ML models have a frustrating property: small changes in the training data can cause large changes in the learned model. Decision trees are a classic example—if you perturb the dataset slightly, the splits can change, leading to very different trees.

Ensemble methods are a systematic way to convert:

The core idea is surprisingly simple: instead of trusting one model, train multiple base learners and aggregate their predictions.

Definition #

An ensemble method constructs a predictor:

and defines H(x) as an aggregation of the hᵢ(x).

Typical forms:

Intuition: “wisdom of the crowd” needs diversity #

If every base learner makes the same mistakes, the ensemble gains nothing.

Ensembles work when base learners are:

  1. 1)Individually better than random (even slightly)
  2. 2)Diverse (errors not perfectly correlated)

A helpful mental model: if each model has some error “noise,” averaging cancels noise—but only if those noises aren’t identical.

A variance-reduction sketch (why averaging helps) #

Suppose we do regression and each model predicts:

hᵢ(x) = f(x) + εᵢ

where f(x) is the true target function (or the expected prediction) and εᵢ is a zero-mean error term.

The ensemble average is:

H(x) = (1/M) ∑ᵢ hᵢ(x)

= (1/M) ∑ᵢ (f(x) + εᵢ)

= f(x) + (1/M) ∑ᵢ εᵢ

So the ensemble error is (1/M) ∑ᵢ εᵢ.

If the εᵢ are independent with Var(εᵢ) = σ², then:

Var(H(x) − f(x))

= Var((1/M) ∑ᵢ εᵢ)

= (1/M²) ∑ᵢ Var(εᵢ)

= (1/M²) · M · σ²

= σ² / M

That 1/M is the dream scenario.

If errors are correlated, you don’t get the full 1/M; this is why creating diversity is a central design goal.

Two big families #

Ensemble methods mainly fall into two camps:

FamilyTraining styleMain goalCommon examples
BaggingParallel (independent)Reduce varianceBagging, Random Forests
BoostingSequential (dependent)Reduce bias (and sometimes variance)AdaBoost, Gradient Boosting, XGBoost-style methods

This lesson focuses on: aggregation, bagging, random forests, and boosting—what they optimize, why they work, and how to reason about their behavior using the bias–variance lens.

Core Mechanic 1: Ensemble Aggregation (Voting, Averaging, Weighting) #

Why talk about aggregation separately? #

Training many models is only half the story. The other half is: how do we combine them so the result is better than each one?

Aggregation defines how information flows from base learners hᵢ(x) to the final ensemble H(x). Different choices correspond to different assumptions about:

Common aggregation rules #

1) Regression: simple averaging #

If hᵢ(x) outputs a real number (e.g., house price), the default aggregation is:

H(x) = (1/M) ∑ᵢ hᵢ(x)

Why this makes sense:

2) Classification: majority voting #

If each hᵢ(x) outputs a class label in {0,1} (or {1,…,K}), a typical ensemble is:

H(x) = mode{h₁(x), …, h_M(x)}

Why this can help:

3) Probability averaging (soft voting) #

Often, classifiers output estimated probabilities pᵢ(y=k | x). Then you can average probabilities and choose argmax:

p̄(y=k | x) = (1/M) ∑ᵢ pᵢ(y=k | x)

H(x) = argmaxₖ p̄(y=k | x)

Soft voting tends to be more stable than hard voting because it uses confidence information.

4) Weighted combinations #

Sometimes models shouldn’t be equal. Boosting is the archetype:

H(x) = sign(∑ᵢ αᵢ hᵢ(x)) (binary classification with hᵢ(x) ∈ {−1, +1})

or for regression:

H(x) = ∑ᵢ αᵢ hᵢ(x)

Here αᵢ is learned from performance (later models may get different weights).

Diversity: the hidden requirement #

Aggregation works best when base learners disagree in useful ways.

How do we create diversity?

A key point: diversity that increases error is not helpful. We want “decorrelated” errors while keeping each hᵢ(x) reasonably accurate.

Bias–variance lens for aggregation #

You already know the bias–variance tradeoff. Aggregation changes that balance:

A simplified but useful heuristic:

A small derivation: averaging with correlated errors #

Assume M models, each with error εᵢ, Var(εᵢ)=σ², and pairwise Corr(εᵢ, εⱼ)=ρ for i≠j.

Let H be the average. Then:

Var(H − f)

= Var((1/M) ∑ᵢ εᵢ)

= (1/M²) Var(∑ᵢ εᵢ)

Now expand Var of a sum:

Var(∑ᵢ εᵢ)

= ∑ᵢ Var(εᵢ) + 2∑_{i<j} Cov(εᵢ, εⱼ)

= Mσ² + 2 · (M(M−1)/2) · (ρσ²)

= Mσ² + M(M−1)ρσ²

So:

Var(H − f)

= (1/M²)[Mσ² + M(M−1)ρσ²]

= (σ²/M) + ( (M−1)/M ) ρσ²

As M → ∞, this approaches ρσ².

Interpretation:

This is why random forests deliberately reduce correlation (via feature subsampling), not just train many trees.

Core Mechanic 2: Bagging (Bootstrap Aggregation) and Random Forests #

Why bagging works #

Decision trees are typically:

Bagging targets that exact profile.

The plan:

  1. 1)Create many slightly different datasets via bootstrap resampling.
  2. 2)Train the same learner on each dataset (often an unpruned tree).
  3. 3)Aggregate predictions (vote/average).

You get a “committee” of trees that individually overfit in different ways, and the vote cancels much of that overfitting.

Bootstrap resampling (the source of diversity) #

Given training set D of size N, a bootstrap sample Dᵢ is created by sampling N points with replacement from D.

Key facts:

A useful calculation: expected fraction of unique points in a bootstrap sample.

Probability a specific point is not selected in one draw: 1 − 1/N.

Not selected in N draws:

(1 − 1/N)ᴺ ≈ e^{−1} ≈ 0.368

So expected fraction left out is ≈ 36.8%, and included is ≈ 63.2%.

Those “left out” points are the out-of-bag (OOB) set for that tree.

Bagging algorithm (conceptual) #

For m = 1..M:

Return ensemble H(x) = average/vote of h_m(x)

Out-of-bag evaluation (a free validation signal) #

Because each training point is excluded from about 36.8% of bootstrap samples, you can estimate generalization error without a separate validation set:

This yields an OOB error estimate that is often close to cross-validation for bagged trees.

Random forests: bagging + feature randomness #

Bagging alone reduces variance, but trees can still be correlated because strong predictors dominate splits.

Random forests add a second diversity lever:

If there are d features total, you choose m_try features per split:

Why it helps:

Random forest prediction #

Interpreting random forests via bias–variance #

Often the net result is better test performance.

Practical hyperparameters (what they do) #

ParameterEffect on biasEffect on varianceNotes
# trees M~ no change↓ (until plateau)More trees rarely overfit; cost is compute
max depth / min leaf↑ when constrainedDeep trees benefit from bagging; forests often use deep trees
m_trycan ↑ if too small↓ by reducing correlationToo small may underuse signal
bootstrap on/offsmalldiversity changesSome variants use subsampling without replacement

Feature importance (connection, but be cautious) #

Random forests often report feature importance:

This is a great tool, but not a proof of causality.

When bagging/forests shine #

When they struggle #

Core Mechanic 3: Boosting (Sequential Ensembles That Fix Mistakes) #

Why boosting exists (a different failure mode) #

Bagging is great when your base learner is already powerful but unstable.

What if your base learner is too weak and underfits?

Boosting attacks bias by building an additive model step-by-step, where each new learner focuses on what previous learners got wrong.

The central idea: additive modeling #

Boosting constructs:

H_M(x) = ∑_{m=1..M} α_m h_m(x)

The sequence matters: hₘ is chosen based on the errors of H_{m−1}.

Two big boosting views:

  1. 1)Reweighting view (AdaBoost-style): focus more on misclassified points.
  2. 2)Gradient view (Gradient Boosting): each learner fits the negative gradient (residual-like signal) of a loss function.

We’ll sketch both, because they illuminate the “why.”


AdaBoost intuition (binary classification) #

Assume labels y ∈ {−1, +1}, and weak learners output h(x) ∈ {−1, +1}.

AdaBoost maintains a weight distribution over training points:

At step m, weighted error is:

err_m = ∑ᵢ wᵢ [h_m(xᵢ) ≠ yᵢ]

Then compute learner weight:

α_m = (1/2) ln((1 − err_m)/err_m)

Update data weights:

wᵢ ← wᵢ · exp(−α_m yᵢ h_m(xᵢ))

Check what this does:

Finally normalize w so ∑ᵢ wᵢ = 1.

The final classifier:

H(x) = sign(∑ₘ α_m h_m(x))

Why it reduces bias: each stage targets the “hard” points not yet handled.

Risk: if you keep adding learners, you can start fitting noise (variance can rise), especially with very complex base learners.


Gradient Boosting intuition (general loss) #

Gradient boosting generalizes boosting to arbitrary differentiable losses.

We want to minimize empirical risk:

L(H) = ∑ᵢ ℓ(yᵢ, H(xᵢ))

We build H as an additive model:

H_m(x) = H_{m−1}(x) + η · f_m(x)

where f_m is a new base learner (often a small tree), and η ∈ (0,1] is a learning rate.

Why the “gradient” appears #

We want to choose f_m that most reduces L. In function space, the steepest descent direction is the negative gradient.

Define pseudo-residuals:

rᵢ^{(m)} = − ∂ℓ(yᵢ, H(xᵢ)) / ∂H(xᵢ) evaluated at H = H_{m−1}

Then fit the next learner to these residuals:

So gradient boosting is like:

Example: squared error regression #

Let ℓ(y, H) = (1/2)(y − H)².

Compute:

∂ℓ/∂H = ∂(1/2)(y − H)² / ∂H

= (1/2) · 2(y − H) · (−1)

= −(y − H)

So the negative gradient is:

r = −(∂ℓ/∂H) = y − H

That’s exactly the residual.

Thus gradient boosting with squared loss repeatedly fits residuals.

Why small trees (stumps) are common #

Boosting often uses weak learners such as:

They have high bias individually, but the additive combination builds complexity gradually. This helps control overfitting and makes training more stable.

Regularization knobs in boosting #

KnobWhat it doesTypical effect
Learning rate ηscales each stepsmaller η needs more trees but often generalizes better
# estimators Mnumber of stepstoo large can overfit unless η small / trees small
Tree depthbase learner complexitydeeper trees fit interactions faster but risk overfitting
Subsample (stochastic GB)fit each step on data subsamplereduces variance, adds randomness

Bias–variance view #

Boosting is often described as:

This is why boosting tends to be powerful but sensitive to tuning.

Application/Connection: Choosing Between Bagging, Random Forests, and Boosting #

Start from the problem: what’s failing? #

Use the bias–variance diagnosis you already know.

This is not absolute, but it’s a strong default.

A decision table #

SituationSymptomsGood first choiceWhy
Tree is unstabletrain accuracy high, test low; results change with small data changesBaggingaveraging cancels instability
Many features, strong predictors dominatebagged trees still similarRandom Forestfeature subsampling reduces correlation
Need top accuracy on tabular datalinear/logistic underfits; interactions matterGradient Boostingsequential correction builds complex function
Need minimal tuningwant strong baseline fastRandom Forestrobust defaults, less sensitive
Need calibrated probabilitiesraw tree votes not calibratedGB + calibrationboosting can improve ranking; calibrate separately

Interpreting H(x) in practice #

Even though an ensemble is “many models,” you still ask:

For random forests, stability increases with M; for boosting, stability depends on regularization.

Computational tradeoffs #

MethodParallelizable?Typical training costTypical inference cost
BaggingYesM independent fitsM predictions
Random ForestYesM independent fits + split feature samplingM predictions
BoostingNot easily (sequential)M sequential fitsM predictions

Boosting’s sequential nature is the price you pay for its targeted bias reduction.

Connection to “model averaging” beyond trees #

Ensembles are not limited to trees.

However, trees are a perfect match because:

A note on stacking (beyond this node) #

Another ensemble family is stacking, where a meta-model learns how to combine base learners. It’s powerful, but adds complexity and risks leakage. For this node, focus on the foundational trio: bagging, random forests, boosting.

Practical checklist #

Before choosing/tuning ensembles:

  1. 1)Ensure train/test split is correct (time series needs time-aware split).
  2. 2)Confirm metric (accuracy vs AUC vs RMSE) matches the goal.
  3. 3)For boosting: start with small depth, small η, then increase M.
  4. 4)For forests: increase M until performance plateaus; tune m_try and leaf size.

Where this node leads #

Understanding ensembles unlocks:

Worked Examples (3) #

Example 1: Majority Vote vs Probability Averaging (Why “Soft Voting” Can Differ) #

Binary classification with three base learners. Each outputs a probability for class 1 at x. Convert to class labels with threshold 0.5.

Model outputs:

  1. Convert each probability to a hard class prediction with threshold 0.5:

    h₁(x) = 1 (since 0.51 ≥ 0.5)

    h₂(x) = 1 (since 0.99 ≥ 0.5)

    h₃(x) = 0 (since 0.01 < 0.5)

  2. Hard majority vote:

    Votes for class 1: 2

    Votes for class 0: 1

    So H_hard(x) = 1

  3. Soft voting (average probabilities):

    p̄ = (1/3)(0.51 + 0.99 + 0.01)

    = (1/3)(1.51)

    ≈ 0.5033

  4. Apply threshold 0.5 to averaged probability:

    H_soft(x) = 1 (since 0.5033 ≥ 0.5)

  5. Now consider a tiny change: p₁ becomes 0.49 (barely flips).

    Hard predictions become:

    • •h₁(x)=0
    • •h₂(x)=1
    • •h₃(x)=0

    Hard vote ⇒ H_hard(x)=0 (2 votes for 0).

  6. Soft voting with new p₁:

    p̄ = (1/3)(0.49 + 0.99 + 0.01)

    = (1/3)(1.49)

    ≈ 0.4967

    Soft vote ⇒ H_soft(x)=0 (since 0.4967 < 0.5)

Insight: Hard voting is sensitive to thresholding each model first; soft voting uses confidence and can be more stable. In practice, soft voting often performs better when base learners are reasonably calibrated.

Example 2: How Random Forests Reduce Correlation (Using the Correlated-Error Variance Formula) #

Assume each tree’s prediction error (for a fixed x) has variance σ² = 9. Compare two ensembles of M = 100 trees:

A) Bagged trees with high correlation ρ = 0.3

B) Random forest trees with lower correlation ρ = 0.05

Use:

Var(H − f) = (σ²/M) + ((M−1)/M) ρ σ²

  1. Compute common quantities:

    M = 100

    σ² = 9

    (σ²/M) = 9/100 = 0.09

    ((M−1)/M) = 99/100 = 0.99

  2. Case A (ρ = 0.3):

    Var = 0.09 + 0.99 · 0.3 · 9

    = 0.09 + 0.99 · 2.7

    = 0.09 + 2.673

    = 2.763

  3. Case B (ρ = 0.05):

    Var = 0.09 + 0.99 · 0.05 · 9

    = 0.09 + 0.99 · 0.45

    = 0.09 + 0.4455

    = 0.5355

  4. Compare:

    2.763 / 0.5355 ≈ 5.16

    So reducing correlation from 0.3 to 0.05 cuts variance by about 5×, even with the same number of trees and same per-tree variance.

Insight: Adding more trees helps, but reducing correlation ρ can matter even more. Random forests do exactly this via feature subsampling, especially at high-level splits.

Example 3: One Step of Gradient Boosting for Squared Error (Residual Fitting) #

We have three training points with scalar x and y:

(x₁, y₁) = (0, 1)

(x₂, y₂) = (1, 3)

(x₃, y₃) = (2, 2)

Start with constant model H₀(x) = 2 (already fit somehow). Use squared error ℓ = (1/2)(y − H)².

Compute residuals and fit a simple base learner f₁ that predicts the mean residual (a constant). Use learning rate η = 0.5.

  1. Compute residuals rᵢ = yᵢ − H₀(xᵢ):

    r₁ = 1 − 2 = −1

    r₂ = 3 − 2 = +1

    r₃ = 2 − 2 = 0

  2. Fit base learner f₁ as a constant equal to mean residual:

    mean(r) = (1/3)(−1 + 1 + 0) = 0

    So f₁(x) = 0 for all x

  3. Update the model:

    H₁(x) = H₀(x) + η f₁(x)

    = 2 + 0.5 · 0

    = 2

  4. Interpretation:

    The residuals had zero mean, so the best constant step is 0; a constant learner cannot improve H₀.

    If instead f₁ were a stump that split x (e.g., predict different residual means for x < 1.5 vs ≥ 1.5), it could make progress.

Insight: Gradient boosting only improves when the base learner can represent the structure in the residuals. This is why small trees (not just constants) are common: they can capture patterns in residuals and reduce loss step-by-step.

Key Takeaways #

Common Mistakes #

Practice #

easy

You train 200 bagged trees for regression. Each tree has error variance σ² = 4 and pairwise correlation ρ = 0.2. Use Var(H − f) = (σ²/M) + ((M−1)/M) ρ σ² to compute the ensemble error variance. Then compute the limiting variance as M → ∞.

Hint: Compute σ²/M and ((M−1)/M) first. The limit removes the σ²/M term and sets ((M−1)/M) → 1.

Show solution

M = 200, σ² = 4, ρ = 0.2.

σ²/M = 4/200 = 0.02

(M−1)/M = 199/200 = 0.995

Var = 0.02 + 0.995 · 0.2 · 4

= 0.02 + 0.995 · 0.8

= 0.02 + 0.796

= 0.816

As M → ∞:

Var → ρσ² = 0.2 · 4 = 0.8

medium

In a random forest for classification with d = 36 features, compare two settings: m_try = 36 (no feature subsampling) vs m_try = √d. What is √d here, and why might it improve test performance even though each split sees fewer features?

Hint: Compute √36. Then relate feature subsampling to reduced correlation ρ between trees.

Show solution

√d = √36 = 6.

With m_try = 36, every split considers all features, so many trees choose the same strong features near the top. Trees become more similar, increasing correlation ρ of their errors.

With m_try = 6, each split considers only 6 random features. Different trees are forced to explore different top splits, reducing correlation ρ. Even if individual trees become slightly weaker, the ensemble variance can drop enough to improve test performance.

medium

For squared error regression with loss ℓ = (1/2)(y − H)², derive the pseudo-residual r = −∂ℓ/∂H and show it equals the ordinary residual (y − H).

Hint: Differentiate (1/2)(y − H)² with respect to H using the chain rule.

Show solution

ℓ(y, H) = (1/2)(y − H)²

∂ℓ/∂H

= (1/2) · 2(y − H) · ∂(y − H)/∂H

= (y − H) · (−1)

= −(y − H)

Therefore the pseudo-residual is:

r = −∂ℓ/∂H = y − H

So for squared loss, gradient boosting fits ordinary residuals.

Connections #

Decision Trees

Bias-Variance Tradeoff

Cross-Validation

Gradient Descent

Regularization

Quality: A (4.3/5)

← back to treebrowse all →