Bayes Theorem

←Back to Tech Tree

inventorycoverage

Bayes Theorem #

Probability & StatisticsDifficulty: ★★☆☆☆Depth: 3Unlocks: 19

P(A|B) = P(B|A)P(A)/P(B). Updating beliefs with evidence.

Interactive Visualization #

⏮◀◀▶▶STEP0.25x1xZOOM

t=0s

Core Concepts #

Essential Relationships #

Prerequisites (1) #

Conditional Probability6 atoms

Unlocks (1) #

Bayesian Inferencelvl 4

Referenced by (2) #

Where this concept shows up in the operating-finance and personal-finance graphs.

From Business (1) #

[Full-Cycle RecruitingBusiness

Each recruiting stage (screen, phone, onsite, reference) is a Bayesian update on candidate-role fit. Strong recruiters implicitly run P(good hire | signal) calculations, updating priors from resume through final round. This framing explains why structured interviews (better likelihoods) outperform unstructured ones.](/business/full-cycle-recruiting/)

From Money (1) #

[Pre-Tax vs Post-TaxMoney

Update tax rate expectations as career evidence accumulates](/money/pre-tax-vs-post-tax/)

Advanced Learning Details

Graph Position #

23

Depth Cost

19

Fan-Out (ROI)

10

Bottleneck Score

3

Chain Length

Cognitive Load #

5

Atomic Elements

19

Total Elements

L0

Percentile Level

L3

Atomic Level

All Concepts (9) #

Teaching Strategy #

Self-serve tutorial - low prerequisites, straightforward concepts.

If you know how to compute P(A|B), Bayes’ Theorem teaches you how to “turn it around” into P(B|A)—and, more importantly, how to update beliefs about a hypothesis when new evidence arrives.

TL;DR:

Bayes’ Theorem is

P(A|B) = P(B|A)P(A)/P(B).

Interpretation:

It is a rule for rational belief-updating under uncertainty.

What Is Bayes’ Theorem? #

Why this concept exists (motivation) #

In real problems you often face the direction mismatch:

Bayes’ Theorem bridges this gap. It’s also the mathematical backbone of “updating beliefs with evidence”: start with an initial belief (a prior), observe data (the evidence), and compute an updated belief (the posterior).

The statement #

For events A and B, with P(B) > 0:

P(A|B) = P(B|A)P(A) / P(B)

You’ll see several names for each term:

Even if the formula feels simple, the meaning is subtle: probability is redistributed across hypotheses when evidence arrives.

Where it comes from (definition-level derivation) #

Bayes’ Theorem is not a “special trick”—it is a direct consequence of the definition of conditional probability.

Start with the definition:

P(A|B) = P(A ∩ B) / P(B)

Also:

P(B|A) = P(A ∩ B) / P(A)

Solve the second equation for P(A ∩ B):

P(A ∩ B) = P(B|A)P(A)

Plug into the first equation:

P(A|B)

= P(A ∩ B) / P(B)

= [P(B|A)P(A)] / P(B)

That’s Bayes’ Theorem.

Intuition in one sentence #

Posterior = (how well A predicts the evidence) × (how plausible A was) ÷ (how surprising the evidence is overall).

A note about language: hypothesis vs evidence #

Often we treat:

But the theorem is symmetric: you can swap roles depending on what you’re trying to compute.

When it’s most useful #

Bayes is most useful when:

  1. You can model P(B|A) (data given hypothesis) more easily than P(A|B).

  2. You have multiple competing hypotheses (A₁, A₂, …) and want to update which is most plausible.

  3. Base rates matter (priors), and ignoring them would lead to bad conclusions.

Core Mechanic 1: Priors, Likelihoods, and Posteriors (Belief Updating) #

Why separate probability into these pieces? #

When you write P(A|B) directly, it can hide structure. Bayes explicitly factors the update into:

This decomposition is powerful because each piece comes from a different source:

The “belief update” view #

Suppose A is a hypothesis and B is a newly observed fact.

Bayes:

P(A|B) ∝ P(B|A)P(A)

Read “∝” as “proportional to.” This is the key intuition:

This proportionality form is often how you reason informally:

  1. Hypotheses that better predict the evidence get boosted.

  2. Hypotheses that poorly predict the evidence get penalized.

  3. Priors still matter—rare things stay rare unless the evidence is very strong.

Discrete hypotheses: odds-style thinking #

Imagine two hypotheses A and ¬A (not A). Using Bayes on both:

P(A|B) = P(B|A)P(A)/P(B)

P(¬A|B) = P(B|¬A)P(¬A)/P(B)

Take the ratio (posterior odds):

P(A|B) / P(¬A|B)

= [P(B|A)P(A)] / [P(B|¬A)P(¬A)]

This shows:

Posterior odds = Likelihood ratio × Prior odds

Where likelihood ratio = P(B|A) / P(B|¬A).

This is useful because the annoying P(B) cancels, and you can see the “strength of evidence” as a ratio.

What each term means (carefully) #

Let’s tie each term to an interpretation you can check:

A common conceptual trap is to confuse P(B|A) with P(A|B). They can be wildly different.

A miniature numeric example (no full story yet) #

Suppose:

We’ll compute P(A|B), but first note: we need P(B). That’s not optional.

Using total probability:

P(B) = P(B|A)P(A) + P(B|¬A)P(¬A)

Compute:

P(B)

= 0.90·0.01 + 0.05·0.99

= 0.009 + 0.0495

= 0.0585

Now Bayes:

P(A|B)

= (0.90·0.01) / 0.0585

= 0.009 / 0.0585

≈ 0.1538

Even with strong evidence, a rare prior can keep the posterior moderate.

Summary table: “parts of Bayes” #

NameSymbolMeaningTypical source
PriorP(A)Belief before seeing BBase rate, historical data
LikelihoodP(BA)Evidence frequency if A true
EvidenceP(B)Overall chance of evidenceComputed via total probability
PosteriorP(AB)Updated belief after B

Core Mechanic 2: Computing the Evidence with the Law of Total Probability #

Why P(B) is the “normalizing constant” #

Bayes’ Theorem divides by P(B). This ensures that the posterior is a valid probability.

If you compute unnormalized weights:

w(A) = P(B|A)P(A)

then the normalized posterior is:

P(A|B) = w(A) / ∑ₖ w(Aₖ)

Where {Aₖ} are mutually exclusive, exhaustive hypotheses.

So P(B) is exactly:

P(B) = ∑ₖ P(B|Aₖ)P(Aₖ)

That is the Law of Total Probability.

Two-hypothesis case: A vs ¬A #

If hypotheses are A and ¬A:

P(B)

= P(B|A)P(A) + P(B|¬A)P(¬A)

This formula is the workhorse behind medical-test and spam-filter calculations.

Multi-class case: A₁, A₂, …, Aₙ #

If you have n hypotheses:

Then:

P(B) = ∑ᵢ P(B|Aᵢ)P(Aᵢ)

And Bayes becomes:

P(Aᵢ|B) = P(B|Aᵢ)P(Aᵢ) / ∑ⱼ P(B|Aⱼ)P(Aⱼ)

Why this matters conceptually #

The evidence term answers: “How often would we see B regardless of which hypothesis is true?”

This matches everyday reasoning: a surprising clue carries more information than a mundane one.

A common technique: compute numerator and denominator separately #

When doing Bayes problems by hand:

  1. Compute the numerator: P(B|A)P(A).

  2. Compute P(B) using total probability.

  3. Divide.

This prevents mistakes like forgetting the ¬A term or miscomputing complements.

Visualization intuition: probability mass reallocation #

Think of the prior probabilities across hypotheses as “mass” that sums to 1.

This is the essence of Bayesian updating.

Small checklist for correctness #

Before finalizing a Bayes calculation:

Application/Connection: Interpreting Tests, Filters, and Simple Classification #

Why Bayes shows up everywhere #

Bayes’ Theorem is the simplest formal model of learning from data:

Even many modern ML systems can be described as “compute something proportional to likelihood × prior, then normalize.”

Medical testing (base rate matters) #

Medical test problems are the classic Bayes showcase because humans often ignore priors.

Key terms you’ll see:

What you usually want is:

P(disease | positive)

That is Bayes with A = disease, B = positive.

The important lesson: even a highly accurate test can yield many false positives when the disease is rare.

Spam filtering / text classification (discrete version) #

Suppose:

Then Bayes says:

P(spam | contains ‘free’)

= P(contains ‘free’ | spam)P(spam) / P(contains ‘free’)

This is the skeleton of naïve Bayes classifiers (where B is many word-features). You will later learn more advanced versions, but the core update logic is identical.

Sensor fusion / robotics (belief update repeated over time) #

In tracking problems:

Repeated Bayes updates over time lead to filters like the Kalman filter and particle filter (conceptually Bayesian, though implementation details differ).

Bayes as the gateway to Bayesian inference #

Bayes’ Theorem for events is the entry point to Bayes for distributions.

Event version:

P(A|B) = P(B|A)P(A)/P(B)

Distribution version (preview):

p(θ|D) ∝ p(D|θ)p(θ)

Where:

This node unlocks that next step.

Quick comparison: Frequentist vs Bayesian (high-level) #

AspectFrequentist (very rough)Bayesian (very rough)
Probability meansLong-run frequencyDegree of belief (coherent with axioms)
ParametersFixed unknown constantsRandom variables with priors
OutputPoint estimates, confidence intervalsPosterior distributions, credible intervals

You don’t need to “choose a side” to use Bayes’ Theorem; you just need to be clear about what probabilities represent in your problem.

Worked Examples (3) #

Medical test: compute P(disease | positive) #

A disease affects 1% of the population. A test has sensitivity 99% and specificity 95%.

Let:

Given:

P(A) = 0.01

P(B|A) = 0.99

Specificity = P(negative|¬A) = 0.95 ⇒ P(positive|¬A) = P(B|¬A) = 0.05

Goal: compute P(A|B).

  1. Compute the complement prior:

    P(¬A) = 1 − P(A)

    = 1 − 0.01

    = 0.99

  2. Compute evidence via total probability:

    P(B) = P(B|A)P(A) + P(B|¬A)P(¬A)

    = 0.99·0.01 + 0.05·0.99

    = 0.0099 + 0.0495

    = 0.0594

  3. Apply Bayes’ Theorem:

    P(A|B) = P(B|A)P(A) / P(B)

    = (0.99·0.01) / 0.0594

    = 0.0099 / 0.0594

    ≈ 0.1667

  4. Interpretation:

    Even with a good test, the posterior is ≈ 16.7%, not ≈ 99%, because false positives among the many healthy people are substantial when the disease is rare.

Insight: The base rate (prior) can dominate. A positive result is evidence, but it’s not the same as near-certainty unless the test is extremely specific or the disease is common.

Factory defects: infer which machine produced an item #

A factory has two machines making the same part.

Let:

A₁ = “part came from Machine 1”

A₂ = “part came from Machine 2”

B = “part is defective”

Given:

P(A₁)=0.70, P(A₂)=0.30

P(B|A₁)=0.02, P(B|A₂)=0.05

Goal: compute P(A₂|B) (probability it came from Machine 2 given defect).

  1. Compute evidence (defect probability overall):

    P(B) = P(B|A₁)P(A₁) + P(B|A₂)P(A₂)

    = 0.02·0.70 + 0.05·0.30

    = 0.014 + 0.015

    = 0.029

  2. Apply Bayes for Machine 2:

    P(A₂|B) = P(B|A₂)P(A₂) / P(B)

    = (0.05·0.30) / 0.029

    = 0.015 / 0.029

    ≈ 0.5172

  3. Optional: compute P(A₁|B) as a sanity check:

    P(A₁|B) = (0.02·0.70)/0.029

    = 0.014/0.029

    ≈ 0.4828

    And indeed 0.5172 + 0.4828 = 1.

Insight: Even though Machine 2 produces fewer parts (lower prior), a defect strongly shifts probability toward it because its likelihood of defect is higher.

Evidence as a normalizer: compute posteriors from unnormalized weights #

You have three hypotheses about a coin:

A₁: fair (P(heads)=0.5)

A₂: biased toward heads (P(heads)=0.8)

A₃: biased toward tails (P(heads)=0.2)

Your prior beliefs are:

P(A₁)=0.6, P(A₂)=0.2, P(A₃)=0.2

You flip once and observe B = “heads”.

Goal: compute P(Aᵢ|heads) for i=1..3 using weights and normalization.

  1. Compute unnormalized weights w(Aᵢ)=P(B|Aᵢ)P(Aᵢ):

    w(A₁) = 0.5·0.6 = 0.30

    w(A₂) = 0.8·0.2 = 0.16

    w(A₃) = 0.2·0.2 = 0.04

  2. Compute evidence as the sum of weights:

    P(B) = ∑ᵢ w(Aᵢ)

    = 0.30 + 0.16 + 0.04

    = 0.50

  3. Normalize to get posteriors:

    P(A₁|heads)=0.30/0.50=0.60

    P(A₂|heads)=0.16/0.50=0.32

    P(A₃|heads)=0.04/0.50=0.08

  4. Interpretation:

    One heads result increases belief in the heads-biased coin and decreases belief in the tails-biased coin, while the fair coin remains most probable due to its strong prior.

Insight: Computing Bayes via “weights then normalize” generalizes cleanly to many hypotheses and avoids repeatedly recomputing P(B) from scratch.

Key Takeaways #

Common Mistakes #

Practice #

easy

A spam filter flags an email if it contains the word “winner”. Suppose:

P(spam)=0.2,

P(“winner”|spam)=0.6,

P(“winner”|not spam)=0.05.

Compute P(spam|“winner”).

Hint: Compute P(“winner”) = P(“winner”|spam)P(spam) + P(“winner”|¬spam)P(¬spam), then apply Bayes.

Show solution

Let A=spam, B=contains “winner”.

P(A)=0.2, P(¬A)=0.8.

P(B)=0.6·0.2 + 0.05·0.8

=0.12 + 0.04

=0.16.

P(A|B)=P(B|A)P(A)/P(B)

=(0.6·0.2)/0.16

=0.12/0.16

=0.75.

medium

Two coins are in a box. Coin 1 is fair. Coin 2 lands heads with probability 0.9. You pick a coin uniformly at random and flip it once; it shows heads. What is P(you picked Coin 2 | heads)?

Hint: Use hypotheses A₁ (fair) and A₂ (biased). Priors are 0.5 and 0.5. Evidence is heads.

Show solution

Let A₂=Coin 2 chosen, B=heads.

P(A₂)=0.5, P(A₁)=0.5.

P(B|A₂)=0.9, P(B|A₁)=0.5.

P(B)=0.9·0.5 + 0.5·0.5

=0.45 + 0.25

=0.70.

P(A₂|B)=0.9·0.5 / 0.70

=0.45/0.70

≈ 0.6429.

hard

A test for a condition has sensitivity 0.97 and specificity 0.98. The condition prevalence is 0.4%. If a person tests positive, compute P(condition | positive). Then explain in one or two sentences why the result is not close to 97%.

Hint: Convert specificity to false positive rate: P(positive|¬condition)=1−0.98. Use P(condition)=0.004.

Show solution

Let A=condition, B=positive.

P(A)=0.004, P(¬A)=0.996.

P(B|A)=0.97.

Specificity=0.98 ⇒ P(B|¬A)=0.02.

P(B)=0.97·0.004 + 0.02·0.996

=0.00388 + 0.01992

=0.02380.

P(A|B)=0.97·0.004 / 0.02380

=0.00388/0.02380

≈ 0.1630 (≈ 16.3%).

Explanation: although the test is sensitive, the condition is rare, so false positives among the many healthy people contribute heavily to positive results.

Connections #

Next you’ll generalize this event-based rule to distributions and parameters in Bayesian Inference.

Related foundations:

Related applications (later nodes often build on Bayes):

Quality: A (4.5/5)

← back to treebrowse all →