Decision Trees

←Back to Tech Tree

inventorycoverage

Decision Trees #

Machine LearningDifficulty: ★★★★☆Depth: 9Unlocks: 1

Tree-structured classifiers. Information gain, Gini impurity.

Interactive Visualization #

⏮◀◀▶▶STEP0.25x1xZOOM

t=0s

Core Concepts #

Key Symbols & Notation #

I(node) : impurity of a nodeDelta I(split) : impurity reduction = I(parent) - weighted average I(children)

Essential Relationships #

Prerequisites (2) #

Entropy5 atomsMachine Learning Introduction5 atoms

Unlocks (1) #

Ensemble Methodslvl 4

Referenced by (2) #

Where this concept shows up in the operating-finance and personal-finance graphs.

From Business (2) #

[TriageBusiness

A 'fever? yes/no' split is literally the first node of a decision tree - recursive binary partitioning on the most informative feature](/business/triage/)[decision treeBusiness

Direct conceptual analog - the personal finance flowchart is a hand-crafted decision tree where each node splits on a financial condition (has emergency fund? employer match available? debt above 7%?) and leaves are actions, paralleling tree-structured classifiers](/business/decision-tree/)

Advanced Learning Details

Graph Position #

117

Depth Cost

1

Fan-Out (ROI)

1

Bottleneck Score

9

Chain Length

Cognitive Load #

6

Atomic Elements

31

Total Elements

L1

Percentile Level

L4

Atomic Level

All Concepts (14) #

Teaching Strategy #

Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.

Decision trees are the “flowcharts” of machine learning: they turn data into a sequence of simple questions (“Is age ≥ 30?”) that ends in an easily explained prediction. The power comes from choosing each question to make the data at the next step as pure (single-class) as possible.

TL;DR:

A decision tree classifier recursively partitions the feature space using feature-based tests. At each node, it chooses a split that maximizes impurity reduction ΔI(split) = I(parent) − ∑ (nᵢ/n) I(childᵢ), where I(·) is an impurity measure (commonly entropy or Gini). Leaves predict using the class distribution at that leaf (often majority class).

What Is a Decision Tree? #

A decision tree is a supervised learning model that predicts a class label by routing an input example through a tree of questions.

The intuition (why trees feel natural) #

Humans often make decisions by asking a sequence of questions:

Each answer narrows possibilities. A decision tree does the same, but it learns which questions to ask from labeled data.

The structure (internal nodes, edges, leaves) #

A classification decision tree consists of:

So the model is a recursive partitioning of the feature space: each split partitions the current region into subregions, and those can be split again.

Notation you’ll see throughout #

Assume K classes. At a node v:

We define a scalar impurity:

For a proposed split s that produces children v₁, …, v_m:

This ΔI(s) is the impurity reduction (also called information gain when using entropy).

What a leaf predicts #

A leaf summarizes the class distribution of the training samples that reached it. Two common outputs:

  1. Hard label (majority vote):
  1. Probabilities:

Trees are attractive because this is both interpretable (you can print the rules) and fast at inference (a handful of comparisons).

A key mental model: partitioning the space #

If your input is a feature vector x ∈ ℝᵈ, each internal node applies a constraint like xⱼ ≤ t. These constraints carve the space into axis-aligned rectangles (for continuous features). Each leaf corresponds to one region, and the tree assigns that region a label.

This is a big deal: decision trees are a way to build piecewise-constant decision boundaries without explicitly writing down complex formulas.

Core Mechanic 1: Node Impurity (Entropy and Gini) #

A tree learns by repeatedly asking: “If I split here, will the child nodes be more pure?” To answer that, we need a number that measures impurity.

Why impurity matters #

At a node where the classes are mixed (say 50/50), prediction is uncertain. At a node that is almost all one class (say 99/1), prediction is easy.

So we want a function I(v) that is:

Two classic choices are entropy and Gini impurity.

Entropy impurity #

Given class probabilities p₁, …, p_K at a node:

H(v) = − ∑ₖ pₖ log₂ pₖ

Properties:

When you use entropy in a tree, the split criterion is often called information gain.

Information gain from a split #

Suppose parent node has entropy H(parent). A split produces children with entropies H(vᵢ). Let nᵢ be child counts.

Weighted child entropy:

H(children) = ∑ᵢ (nᵢ/n) H(vᵢ)

Information gain:

IG(split) = H(parent) − H(children)

Notice this matches the general form:

ΔI(split) = I(parent) − ∑ᵢ (nᵢ/n) I(childᵢ)

with I = H.

Gini impurity #

For class probabilities p₁, …, p_K:

G(v) = 1 − ∑ₖ pₖ²

Interpretation:

For binary classification with p = P(class=1) and (1−p) = P(class=0):

G = 1 − (p² + (1−p)²)

Derive it explicitly:

G = 1 − (p² + (1 − 2p + p²))

= 1 − (1 − 2p + 2p²)

= 2p − 2p²

= 2p(1 − p)

So Gini is a simple parabola peaking at p = 0.5.

Entropy vs Gini (what’s the difference?) #

They behave similarly: both are 0 at purity, both peak at maximum mixing. In practice:

Most libraries let you choose. Either is fine as long as you understand the weighted impurity reduction idea.

CriterionImpurity I(v)Split scoreNotes
EntropyH = −∑ p log₂ pIG = H(parent) − ∑ (nᵢ/n)H(childᵢ)Classic ID3/C4.5 framing
GiniG = 1 − ∑ p²ΔG = G(parent) − ∑ (nᵢ/n)G(childᵢ)Default in CART; fast

Breathing room: what impurity is NOT #

Impurity is not “error rate.” A node could have majority class 60% (so error rate 40%) and still have a certain entropy or Gini. Impurity measures heterogeneity, not misclassification under a fixed rule.

This distinction matters because impurity functions are designed to make greedy splitting work well.

Core Mechanic 2: Choosing Splits via Impurity Reduction (Greedy Recursive Partitioning) #

A decision tree is usually trained with a greedy algorithm: at each node, choose the split that maximizes impurity reduction, then recurse.

Why greedy splitting? #

The global problem—finding the best possible tree of a given size—is combinatorial and generally intractable for large datasets. Greedy splitting is a practical compromise:

  1. At the current node, evaluate many candidate splits.

  2. Pick the best local split (largest ΔI).

  3. Recurse on children until a stopping rule.

This is why impurity functions matter so much: they are the local objective that guides the whole structure.

Candidate splits #

For numeric (continuous) features #

A common split form:

Candidates t are often chosen from sorted unique feature values (or midpoints between them). If the sorted values are a₁ ≤ a₂ ≤ … ≤ a_M, midpoints are:

tᵣ = (aᵣ + aᵣ₊₁)/2

Each candidate threshold produces a partition; compute ΔI and choose max.

For categorical features #

Common choices:

Binary splits over categories can be expensive if there are many categories (many possible subsets S).

Impurity reduction formula (the core computation) #

At node v, with impurity I(v), consider split s producing children v₁,…,v_m.

Let nᵢ be the number of training examples going to child i, and n be parent size.

Weighted child impurity:

I_after(s) = ∑ᵢ (nᵢ/n) I(vᵢ)

Impurity reduction:

ΔI(s) = I(v) − I_after(s)

Choose:

s* = argmax_s ΔI(s)

This is the tree’s “learning signal.”

A useful interpretation: weighted average purity #

Because the child impurities are weighted by nᵢ/n, the split prefers:

Stopping conditions (when do we stop splitting?) #

If you always split until every leaf is pure, you usually overfit. Common stopping rules:

These constraints control complexity.

Leaf prediction (revisited, with probabilities) #

At a leaf ℓ with counts nₖ:

pₖ = nₖ / n

Hard prediction:

ŷ = argmaxₖ pₖ

Probability output:

P(y = k | x in leaf ℓ) ≈ pₖ

Many implementations also apply smoothing (e.g., Laplace) to avoid 0 probabilities:

pₖ = (nₖ + α) / (n + Kα)

Computational note: how training scales #

At a high level:

Practical implementations optimize heavily, but conceptually the cost is tied to:

Breathing room: what the tree is really doing #

It’s easy to get lost in formulas. Keep the picture:

The impurity reduction formula is just the arithmetic that turns “closer to single-class” into a number the algorithm can maximize.

Application/Connection: Strengths, Weaknesses, and Why Trees Lead to Ensembles #

Decision trees are popular because they’re simple to explain and surprisingly powerful. But their weaknesses are exactly why ensemble methods (random forests, boosting) are so important.

Strengths of decision trees #

1) Interpretability #

A tree yields explicit rules:

You can inspect paths, feature usage, and thresholds.

2) Minimal preprocessing #

3) Captures interactions #

A split on x₁ followed by a split on x₂ naturally represents feature interactions:

Linear models require explicit interaction terms; trees get them structurally.

Weaknesses (and the motivations for fixes) #

1) High variance (instability) #

Small changes in the data can change early splits, which changes the entire tree.

This makes a single tree prone to overfitting, especially when allowed to grow deep.

2) Greedy splitting is myopic #

Each split is locally optimal, not globally optimal. The tree can commit early to a split that seems good now but blocks better structure later.

3) Axis-aligned splits can be inefficient #

A single split is xⱼ ≤ t. If the true boundary is diagonal, the tree may need many splits to approximate it.

Regularization strategies (before ensembles) #

To tame overfitting, we control tree complexity:

Cost-complexity pruning (high-level idea) #

You trade off fit and size:

Objective ≈ training_loss(tree) + λ · (#leaves)

Pruning removes subtrees that don’t justify their complexity.

Why trees unlock ensembles #

Trees are strong “base learners” because:

But because they are high variance, averaging many trees (bagging) stabilizes them.

Connection to Random Forests (preview) #

Random forests reduce correlation between trees by:

The end result is a much more robust predictor.

Connection to Boosting (preview) #

Boosting builds trees sequentially, where each tree focuses on correcting the errors of the previous ones. The trees are often shallow (stumps or small depth), and the ensemble forms a powerful model.

When to use a single decision tree #

A single tree can be the right choice when:

If predictive performance is the main goal, ensembles built from trees are often the next step.

Worked Examples (3) #

Compute Gini impurity and impurity reduction for a binary split #

At a parent node v, there are n = 10 samples with class counts: n₀ = 6, n₁ = 4. So p₀ = 0.6 and p₁ = 0.4.

A candidate split s produces two children:

Use Gini impurity I(v) = 1 − ∑ pₖ² and compute ΔI(s).

  1. Compute parent Gini:

    p₀ = 6/10 = 0.6

    p₁ = 4/10 = 0.4

    G(parent) = 1 − (0.6² + 0.4²)

    = 1 − (0.36 + 0.16)

    = 1 − 0.52

    = 0.48

  2. Compute left child Gini:

    Left has (4,0) so p₀ = 1, p₁ = 0

    G(L) = 1 − (1² + 0²)

    = 0

  3. Compute right child Gini:

    Right has (2,4) out of 6:

    p₀ = 2/6 = 1/3

    p₁ = 4/6 = 2/3

    G(R) = 1 − ((1/3)² + (2/3)²)

    = 1 − (1/9 + 4/9)

    = 1 − 5/9

    = 4/9

    ≈ 0.4444

  4. Compute weighted child impurity:

    n_L/n = 4/10 = 0.4

    n_R/n = 6/10 = 0.6

    G_after = 0.4·G(L) + 0.6·G(R)

    = 0.4·0 + 0.6·(4/9)

    = 0.6·4/9

    = 2.4/9

    = 0.2666…

  5. Compute impurity reduction:

    ΔG(s) = G(parent) − G_after

    = 0.48 − 0.2666…

    = 0.2133…

Insight: Even though the right child is still mixed, making the left child perfectly pure and reducing impurity in the larger right group creates a substantial weighted improvement. The weighting by child sizes is crucial: it prevents tiny pure splits from dominating unless they meaningfully improve the overall purity.

Information gain with entropy for a 3-class node and a multiway split #

A node v has n = 12 samples across 3 classes (A,B,C) with counts (6,3,3). Consider a split that produces 3 children (a multiway split) with counts:

Compute entropy H(v) in bits (log₂), the weighted child entropy, and the information gain IG.

  1. Compute parent probabilities:

    p_A = 6/12 = 0.5

    p_B = 3/12 = 0.25

    p_C = 3/12 = 0.25

  2. Compute parent entropy:

    H(parent) = −[0.5 log₂(0.5) + 0.25 log₂(0.25) + 0.25 log₂(0.25)]

    Use log₂(0.5) = −1 and log₂(0.25) = −2:

    H(parent) = −[0.5(−1) + 0.25(−2) + 0.25(−2)]

    = −[−0.5 −0.5 −0.5]

    = 1.5 bits

  3. Compute child entropies:

    Child 1: (4,0,0) ⇒ probabilities (1,0,0)

    H₁ = 0

    Child 2: (2,3,0) out of 5 ⇒ (0.4, 0.6, 0)

    H₂ = −[0.4 log₂(0.4) + 0.6 log₂(0.6)]

    Compute approximately:

    log₂(0.4) ≈ −1.3219

    log₂(0.6) ≈ −0.73697

    H₂ ≈ −[0.4(−1.3219) + 0.6(−0.73697)]

    ≈ −[−0.52876 −0.44218]

    ≈ 0.97094 bits

    Child 3: (0,0,3) ⇒ probabilities (0,0,1)

    H₃ = 0

  4. Compute weighted child entropy:

    Weights: n₁/n = 4/12 = 1/3

    n₂/n = 5/12

    n₃/n = 3/12 = 1/4

    H(children) = (1/3)·0 + (5/12)·0.97094 + (1/4)·0

    = (5/12)·0.97094

    ≈ 0.40456 bits

  5. Compute information gain:

    IG = H(parent) − H(children)

    = 1.5 − 0.40456

    ≈ 1.09544 bits

Insight: Multiway splits can produce very pure children (here, two children have entropy 0). Entropy makes the gain interpretable as “bits of uncertainty removed” about the class label after you know which branch the example follows.

From a trained leaf distribution to predictions and probabilities (and why it matters) #

A leaf ℓ contains n = 20 training samples with class counts: (class 0: 7, class 1: 13).

  1. What hard class label does the leaf predict?

  2. What probability estimate does it output?

  3. If you apply Laplace smoothing with α = 1 for K = 2 classes, what are the smoothed probabilities?

  1. Compute empirical probabilities:

    p₀ = 7/20 = 0.35

    p₁ = 13/20 = 0.65

  2. Hard label (majority class):

    ŷ = argmax(p₀, p₁) = 1

  3. Probability output without smoothing:

    P(y=0 | ℓ) ≈ 0.35

    P(y=1 | ℓ) ≈ 0.65

  4. Laplace smoothing with α=1 and K=2:

    p₀' = (n₀ + α) / (n + Kα)

    = (7 + 1) / (20 + 2)

    = 8/22

    ≈ 0.3636

    p₁' = (n₁ + α) / (n + Kα)

    = (13 + 1) / (22)

    = 14/22

    ≈ 0.6364

Insight: Leaves store distributions, not just labels. That distribution becomes a probability model (often used for calibration, decision thresholds, and downstream costs). Smoothing matters most when leaves are small, preventing 0/1 probabilities that can be overconfident.

Key Takeaways #

Common Mistakes #

Practice #

easy

A node has class counts (9 positive, 3 negative). Compute (a) Gini impurity and (b) entropy in bits (log₂).

Hint: Use p₊ = 9/12 and p₋ = 3/12. G = 1 − (p₊² + p₋²). H = −[p₊ log₂ p₊ + p₋ log₂ p₋].

Show solution

p₊ = 9/12 = 0.75, p₋ = 0.25.

(a) G = 1 − (0.75² + 0.25²)

= 1 − (0.5625 + 0.0625)

= 1 − 0.625

= 0.375.

(b) H = −[0.75 log₂(0.75) + 0.25 log₂(0.25)].

log₂(0.25) = −2.

log₂(0.75) ≈ −0.4150.

So H ≈ −[0.75(−0.4150) + 0.25(−2)]

≈ −[−0.3113 −0.5]

≈ 0.8113 bits.

medium

A parent node has 30 samples with class counts (18, 12). A candidate split produces:

Compute the Gini impurity reduction ΔG.

Hint: Compute G(parent), G(left), G(right), then G_after = (10/30)G(left) + (20/30)G(right), then subtract.

Show solution

Parent probabilities: p₁ = 18/30 = 0.6, p₂ = 12/30 = 0.4.

G(parent) = 1 − (0.6² + 0.4²) = 1 − (0.36 + 0.16) = 0.48.

Left: (9,1) ⇒ p = 0.9 and 0.1.

G(L) = 1 − (0.9² + 0.1²) = 1 − (0.81 + 0.01) = 0.18.

Right: (9,11) out of 20 ⇒ p = 0.45 and 0.55.

G(R) = 1 − (0.45² + 0.55²)

= 1 − (0.2025 + 0.3025)

= 1 − 0.505

= 0.495.

Weighted child impurity:

G_after = (10/30)·0.18 + (20/30)·0.495

= (1/3)·0.18 + (2/3)·0.495

= 0.06 + 0.33

= 0.39.

Impurity reduction:

ΔG = 0.48 − 0.39 = 0.09.

hard

You are training a tree and at some node you have two candidate splits s₁ and s₂. The node has n = 100 samples.

Assuming the parent impurity is I(parent)=0.32, which split is chosen by impurity reduction? Show the computation.

Hint: Compute I_after(s) as the weighted average of child impurities, then ΔI(s) = 0.32 − I_after(s).

Show solution

For s₁:

I_after(s₁) = (90/100)·0.30 + (10/100)·0.00

= 0.27 + 0

= 0.27.

ΔI(s₁) = 0.32 − 0.27 = 0.05.

For s₂:

I_after(s₂) = (50/100)·0.20 + (50/100)·0.20

= 0.10 + 0.10

= 0.20.

ΔI(s₂) = 0.32 − 0.20 = 0.12.

Since ΔI(s₂) > ΔI(s₁), the algorithm chooses s₂.

Interpretation: even though s₁ creates one perfectly pure small child, s₂ substantially reduces impurity for half the data and yields a larger overall weighted improvement.

Connections #

Next, decision trees become the building block for ensembles:

Related prerequisites and helpful refreshers:

Quality: A (4.4/5)

← back to treebrowse all →