Entropy

←Back to Tech Tree

inventorycoverage

Entropy #

Information TheoryDifficulty: ★★★☆☆Depth: 6Unlocks: 14

Measure of uncertainty/information content. H(X) = -sum p log p.

Interactive Visualization #

⏮◀◀▶▶STEP0.25x1xZOOM

t=0s

Core Concepts #

Key Symbols & Notation #

H(X) - entropy of random variable XI(x) - surprisal of outcome x, I(x) = -log p(x)

Essential Relationships #

Prerequisites (2) #

Expected Value5 atomsCommon Distributions6 atoms

Unlocks (7) #

KL Divergencelvl 4Mutual Informationlvl 3Cross-Entropylvl 4Decision Treeslvl 4Rate-Distortion Theorylvl 5Channel Capacitylvl 5Information-Theoretic Boundslvl 5

Referenced by (1) #

Where this concept shows up in the operating-finance and personal-finance graphs.

From Business (1) #

[Tribal KnowledgeBusiness

Tribal knowledge is high-entropy information by definition - it carries high surprise value precisely because it is rare, undocumented, and not redundant with any written source. Understanding Shannon entropy clarifies why tribal knowledge is disproportionately valuable per bit](/business/tribal-knowledge/)

Advanced Learning Details

Graph Position #

45

Depth Cost

14

Fan-Out (ROI)

8

Bottleneck Score

6

Chain Length

Cognitive Load #

5

Atomic Elements

39

Total Elements

L2

Percentile Level

L3

Atomic Level

All Concepts (17) #

Teaching Strategy #

Self-serve tutorial - low prerequisites, straightforward concepts.

Why does a fair coin flip feel like “one bit” of uncertainty, while a coin that lands heads 99% of the time feels like almost none? Entropy is the mathematical tool that turns that intuition into a precise number—and it becomes the currency for compression, prediction, and learning.

TL;DR:

Entropy H(X) measures the average uncertainty (average “surprise”) of a random variable X. For discrete X with probabilities p(x), surprisal is I(x) = −log p(x), and entropy is the expected surprisal: H(X) = E[I(X)] = −∑ₓ p(x) log p(x). The log base sets the unit (base 2 → bits, base e → nats). Entropy is maximized by the uniform distribution and minimized (0) when outcomes are certain.

What Is Entropy? #

Why we need a new quantity (motivation) #

Probability tells you what tends to happen. But often you want a single number answering a different question:

A single event can be “surprising” or “unsurprising.” If something was already likely, learning it doesn’t teach you much. If it was unlikely, you learn a lot.

Entropy formalizes this by:

  1. defining surprisal for one outcome, then

  2. averaging surprisal according to the distribution.

Surprisal: information content of one outcome #

Let X be a discrete random variable taking values x with probability p(x).

We define the surprisal (also called self-information) of outcome x as:

I(x) = −log p(x)

Intuition:

The log base chooses units:

Changing base just rescales:

log_b p = (log_a p) / (log_a b)

Entropy: expected surprisal #

Surprisal is about one outcome. Entropy is about the whole random variable.

Definition (discrete entropy):

H(X) = E[I(X)]

= ∑ₓ p(x) I(x)

= ∑ₓ p(x) (−log p(x))

= −∑ₓ p(x) log p(x)

This is the canonical formula you see in information theory.

Intuition check with extremes #

H(X) = −1 · log 1 = 0

Zero uncertainty, zero information gained by observing.

H(X) = −∑ₓ (1/n) log(1/n)

= −n · (1/n) · (−log n)

= log n

This grows as the number of equally likely possibilities grows.

What entropy is (and is not) #

Entropy is:

Entropy is not:

Keep the mental model: entropy is average surprise.

Core Mechanic 1: Surprisal and the Log #

Why does information grow like a log? #

If an outcome has probability p, the “number of yes/no questions” you need to identify it (in the best case) behaves like log₂(1/p). This is because each binary question can cut the remaining possibilities roughly in half.

The log is the unique (up to scaling) function that turns multiplication into addition:

Concretely, if two independent events happen: A with p(A) and B with p(B), then:

p(A ∩ B) = p(A) p(B)

Surprisal of the joint event should be:

I(A ∩ B) = −log(p(A)p(B))

= −(log p(A) + log p(B))

= (−log p(A)) + (−log p(B))

= I(A) + I(B)

So the −log choice makes “information from independent events” add up cleanly.

Practical meanings of surprisal #

If we use log₂:

Each halving of probability adds one bit of surprisal.

From surprisal to entropy: averaging with expected value #

You already know expected value:

E[g(X)] = ∑ₓ p(x) g(x)

Entropy is simply that with g(x) = −log p(x):

H(X) = E[−log p(X)]

A key pacing idea: entropy is not mysterious new machinery—it’s “expected value” applied to the random variable “surprisal.”

Units and interpretation (bits vs nats) #

If you compute with log₂, H(X) is in bits.

If you compute with ln, H(X) is in nats.

Conversion:

H_bits(X) = H_nats(X) / ln 2

This matters because in machine learning, losses often use ln (because calculus is cleaner), but we still conceptually speak in bits.

The shape of uncertainty for a Bernoulli variable #

Let X ∈ {0,1} with P(X=1)=p and P(X=0)=1−p.

H(X) = −p log p − (1−p) log(1−p)

This function has a distinctive shape:

If logs are base 2, the maximum is 1 bit.

A small table: how surprisal behaves #

Probability p(x)Surprisal I(x)=−log₂ p(x)Interpretation
10No surprise
1/21One bit of info
1/42Two bits
0.1≈ 3.322More than 3 yes/no questions on average
0.01≈ 6.644Very surprising

Takeaway: surprisal is a smooth measure that grows rapidly as p(x) shrinks.

Core Mechanic 2: Properties of Entropy (and Why They Matter) #

Entropy is useful partly because it obeys clean, predictable rules. These rules let you reason about coding, uncertainty, and learning systems without re-deriving everything.

Property A: Non-negativity #

Because 0 ≤ p(x) ≤ 1, we have log p(x) ≤ 0, hence −p(x) log p(x) ≥ 0 for each term.

So:

H(X) = −∑ₓ p(x) log p(x) ≥ 0

Entropy can’t be negative.

Property B: Entropy is 0 iff X is certain #

If X always takes one value with probability 1, then H(X)=0.

Conversely, if H(X)=0, all terms −p(x) log p(x) must be 0. For any p(x) in (0,1), log p(x) < 0 and the term is positive, so the only way all terms vanish is when one outcome has probability 1 and the rest 0.

So entropy captures “no uncertainty.”

Property C: Maximum entropy for fixed support is uniform #

Suppose X can take n outcomes.

Why this matters: if you know only that there are n possible outcomes, “most uncertain” means “assume uniform.”

Property D: Chain rule (entropy of joint variables) #

Let (X,Y) be a pair of discrete random variables.

The chain rule says:

H(X,Y) = H(X) + H(Y|X)

Interpretation:

This mirrors how you might actually encode (X,Y): encode X, then encode Y using a code adapted to the conditional distribution P(Y|X).

A brief derivation sketch (using definitions):

H(X,Y)

= −∑ₓ ∑ᵧ p(x,y) log p(x,y)

= −∑ₓ ∑ᵧ p(x,y) log( p(x) p(y|x) )

= −∑ₓ ∑ᵧ p(x,y) [log p(x) + log p(y|x)]

= −∑ₓ ∑ᵧ p(x,y) log p(x) − ∑ₓ ∑ᵧ p(x,y) log p(y|x)

For the first term:

−∑ₓ ∑ᵧ p(x,y) log p(x)

= −∑ₓ log p(x) ∑ᵧ p(x,y)

= −∑ₓ log p(x) p(x)

= H(X)

For the second term, by definition that is H(Y|X).

Property E: Conditioning reduces entropy #

You should expect that knowing more can’t increase uncertainty:

H(X|Y) ≤ H(X)

Interpretation: if you observe Y, you can only become more certain (or equally uncertain) about X.

This idea is a cornerstone for later nodes like mutual information:

I(X;Y) = H(X) − H(X|Y)

Property F: Entropy and “effective number of outcomes” #

If H(X)=log n, it behaves like you have n equally likely outcomes.

This motivates the notion of “perplexity” used in language modeling:

perplexity(X) = 2^{H_bits(X)}

If H=10 bits, perplexity is 1024: it’s as if, on average, there are ~1024 equally plausible next tokens.

When entropy is the right tool #

Entropy answers questions like:

And importantly, it does so with algebraic rules that compose well across systems.

Application/Connection: Compression, Learning, and Decision-Making #

Entropy as a limit for lossless compression #

Imagine X generates symbols with distribution p(x). If you design an optimal prefix-free code (e.g., Huffman coding), the expected code length L̄ satisfies:

H(X) ≤ L̄ < H(X) + 1

Interpretation:

So entropy is not just “an abstract uncertainty”—it’s a concrete benchmark for storage and transmission.

Why entropy shows up in machine learning losses #

In classification, we often model a distribution q over labels given features, and reality follows p. The cross-entropy:

H(p,q) = −∑ₓ p(x) log q(x)

is the expected surprisal when you use q but the world is p.

This decomposes as:

H(p,q) = H(p) + KL(p‖q)

So minimizing cross-entropy (common in neural nets) is equivalent to minimizing KL divergence because H(p) is fixed.

Even if you don’t yet know KL deeply, you can see the story:

Decision trees: entropy as an impurity measure #

Decision trees choose splits that reduce uncertainty about the label Y.

At a node, compute entropy of labels:

H(Y) = −∑ᵧ p(y) log p(y)

After splitting on feature X, you get conditional entropy:

H(Y|X) = ∑ₓ p(x) H(Y|X=x)

Information gain is:

IG = H(Y) − H(Y|X)

So trees pick the feature that maximizes IG (largest reduction in label uncertainty).

Entropy as “how many questions remain?” #

If you use log₂, entropy is measured in bits, and a bit is literally one yes/no question.

So H(X)=3 bits means: with optimal questioning, you need about 3 binary questions on average to identify X.

This mental model helps you sanity-check:

Connections you’re about to unlock #

Once entropy feels natural, several powerful concepts become almost inevitable:

Entropy is the hub: it’s the “energy unit” in the economy of information.

Worked Examples (3) #

Entropy of a Bernoulli(p): why the maximum is at p = 1/2 #

Let X ∈ {0,1} with P(X=1)=p and P(X=0)=1−p. Compute H(X) (base 2), and verify it is maximized at p=1/2.

  1. Start from the definition:

    H(X) = −∑ₓ p(x) log₂ p(x)

  2. Plug in the two outcomes:

    H(p) = −p log₂ p − (1−p) log₂(1−p)

  3. Check a few values for intuition:

    • •p=0 ⇒ H=0
    • •p=1 ⇒ H=0
    • •p=1/2 ⇒ H = −(1/2)log₂(1/2) − (1/2)log₂(1/2)

    = −(1/2)(−1) − (1/2)(−1)

    = 1 bit

  4. To verify the maximum location, it is easier to differentiate using ln and convert later.

    Define h(p) = −p ln p − (1−p) ln(1−p). (This is entropy in nats.)

  5. Differentiate:

    ∂h/∂p = −(ln p + 1) − [−ln(1−p) − 1]

    = −ln p − 1 + ln(1−p) + 1

    = ln((1−p)/p)

  6. Set derivative to 0:

    ln((1−p)/p) = 0

    ⇒ (1−p)/p = 1

    ⇒ 1−p = p

    ⇒ p = 1/2

  7. Second derivative check:

    ∂²h/∂p² = ∂/∂p [ln(1−p) − ln p]

    = −1/(1−p) − 1/p

    < 0 for p ∈ (0,1)

    So p=1/2 is a maximum.

  8. Convert back to bits if needed:

    H_bits(p) = h(p) / ln 2, and at p=1/2 we get 1 bit.

Insight: Entropy is maximized when outcomes are as balanced as possible. For a binary variable, “most uncertain” means “fair coin,” giving exactly 1 bit of average information.

Entropy of a 4-symbol source and the ‘effective number of outcomes’ #

Let X take values {A,B,C,D} with probabilities p = (1/2, 1/4, 1/8, 1/8). Compute H(X) in bits and compute perplexity 2^{H(X)}.

  1. Write the entropy:

    H(X) = −∑ p(x) log₂ p(x)

  2. Compute each surprisal:

    I(A)=−log₂(1/2)=1

    I(B)=−log₂(1/4)=2

    I(C)=−log₂(1/8)=3

    I(D)=−log₂(1/8)=3

  3. Average them with probabilities:

    H(X) = (1/2)·1 + (1/4)·2 + (1/8)·3 + (1/8)·3

  4. Compute:

    (1/2)·1 = 0.5

    (1/4)·2 = 0.5

    (1/8)·3 = 0.375

    (1/8)·3 = 0.375

  5. Sum:

    H(X) = 0.5 + 0.5 + 0.375 + 0.375 = 1.75 bits

  6. Compute perplexity:

    perplexity = 2^{H} = 2^{1.75}

    = 2^{1} · 2^{0.75}

    = 2 · 2^{3/4}

    ≈ 2 · 1.6818

    ≈ 3.36

Insight: Although there are 4 possible outcomes, the distribution is skewed, so uncertainty behaves like having only ~3.36 equally likely outcomes.

Entropy reduction after observing a feature (information gain preview) #

A dataset has a binary label Y with P(Y=1)=3/4 and P(Y=0)=1/4. A binary feature X splits the data: when X=0 (prob 1/2), P(Y=1|X=0)=1; when X=1 (prob 1/2), P(Y=1|X=1)=1/2. Compute H(Y), H(Y|X), and the entropy reduction H(Y)−H(Y|X) (base 2).

  1. Compute the initial label entropy:

    H(Y) = −(3/4)log₂(3/4) − (1/4)log₂(1/4)

  2. Use known logs:

    log₂(1/4)=−2

    log₂(3/4)=log₂ 3 − log₂ 4 = log₂ 3 − 2 ≈ 1.585 − 2 = −0.415

  3. Plug in:

    H(Y) = −(3/4)(−0.415) − (1/4)(−2)

    = (3/4)(0.415) + (1/4)(2)

    ≈ 0.311 + 0.5

    ≈ 0.811 bits

  4. Compute conditional entropy via the split:

    H(Y|X) = ∑ₓ p(x) H(Y|X=x)

    = (1/2)H(Y|X=0) + (1/2)H(Y|X=1)

  5. For X=0, P(Y=1|X=0)=1 so the label is certain:

    H(Y|X=0)=0

  6. For X=1, P(Y=1|X=1)=1/2 so it is a fair coin:

    H(Y|X=1)=1 bit

  7. Therefore:

    H(Y|X) = (1/2)·0 + (1/2)·1 = 0.5 bits

  8. Entropy reduction (information gain idea):

    H(Y) − H(Y|X) ≈ 0.811 − 0.5 = 0.311 bits

Insight: A feature is useful when it makes the label distribution more certain. Here, X removes about 0.311 bits of uncertainty about Y on average—exactly the quantity decision trees try to maximize.

Key Takeaways #

Common Mistakes #

Practice #

easy

Compute the entropy (in bits) of a fair six-sided die roll. Then compute the entropy if the die is loaded so that P(1)=1/2 and P(2)=P(3)=P(4)=P(5)=P(6)=1/10.

Hint: Uniform over n outcomes has H=log₂ n. For the loaded die, compute −∑ p log₂ p and reuse that −log₂(1/2)=1 and −log₂(1/10)=log₂ 10 ≈ 3.322.

Show solution

Fair die: H = log₂ 6 ≈ 2.585 bits.

Loaded die:

H = −(1/2)log₂(1/2) − 5·(1/10)log₂(1/10)

= (1/2)·1 + 5·(1/10)·log₂ 10

= 0.5 + 0.5·3.322

≈ 0.5 + 1.661

≈ 2.161 bits.

medium

Let X take values {a,b,c} with probabilities (1/2, 1/3, 1/6). Compute H(X) in bits.

Hint: Compute each term p(x)·(−log₂ p(x)). You may use log₂ 3 ≈ 1.585 and log₂ 6 = log₂(2·3)=1+log₂ 3 ≈ 2.585.

Show solution

H(X)= −(1/2)log₂(1/2) − (1/3)log₂(1/3) − (1/6)log₂(1/6)

Compute surprisals:

−log₂(1/2)=1

−log₂(1/3)=log₂ 3 ≈ 1.585

−log₂(1/6)=log₂ 6 ≈ 2.585

Average:

H ≈ (1/2)·1 + (1/3)·1.585 + (1/6)·2.585

≈ 0.5 + 0.528 + 0.431

≈ 1.459 bits.

medium

Suppose X and Y are independent and each is a fair coin flip. Compute H(X), H(Y), and H(X,Y). Verify the chain rule H(X,Y)=H(X)+H(Y|X).

Hint: Fair coin entropy is 1 bit. Independence implies H(Y|X)=H(Y). Joint distribution has 4 equally likely outcomes.

Show solution

H(X)=1 bit and H(Y)=1 bit.

Because X and Y are independent, knowing X does not change uncertainty about Y, so H(Y|X)=H(Y)=1 bit.

The pair (X,Y) has 4 equally likely outcomes, so:

H(X,Y)=log₂ 4 = 2 bits.

Chain rule check:

H(X)+H(Y|X) = 1 + 1 = 2 = H(X,Y).

Connections #

Next nodes you can study:

Quality: A (4.6/5)

← back to treebrowse all →