Causal Inference

←Back to Tech Tree

inventorycoverage

Causal Inference #

Probability & StatisticsDifficulty: ★★★★★Depth: 8Unlocks: 0

Distinguishing correlation from causation. DAGs, do-calculus.

Interactive Visualization #

⏮◀◀▶▶STEP0.25x1xZOOM

t=0s

Core Concepts #

Key Symbols & Notation #

do(X = x) (the do-operator denoting intervention)

Essential Relationships #

Prerequisites (2) #

Bayesian Inference5 atomsTopological Sort5 atoms

Advanced Learning Details

Graph Position #

127

Depth Cost

0

Fan-Out (ROI)

0

Bottleneck Score

8

Chain Length

Cognitive Load #

5

Atomic Elements

49

Total Elements

L4

Percentile Level

L3

Atomic Level

All Concepts (26) #

Teaching Strategy #

Quick unlock - significant prerequisite investment but simple final step. Verify prerequisites first.

You see two variables move together: a medicine and recovery, education and income, rain and umbrella sales. Your brain wants a story—one causes the other. Causal inference is the discipline of turning that story into a testable, formal claim: “If I intervene and force X to be x, how would Y change?”

TL;DR:

Causal inference distinguishes observing from doing. A causal DAG encodes assumptions about how variables generate one another. The causal effect is defined by an interventional distribution P(Y | do(X=x)). Identification asks whether P(Y | do(X)) can be computed from observational data P(·) given the DAG. Do-calculus provides sound transformation rules to rewrite interventional queries into observational ones when possible (often via backdoor, frontdoor, or more general adjustments).

What Is Causal Inference? #

Why you can’t get causation “for free” from correlation #

In everyday data analysis, you often estimate something like P(Y | X=x): the distribution of outcomes among units where X happens to equal x. That is an observational quantity. It answers: “Among the units I observed with X=x, what does Y look like?”

A causal question is different. It asks: “If I were to force X to be x (possibly contrary to what it would naturally be), what would Y look like?” That is P(Y | do(X=x)). The do-operator is not a stylistic choice; it marks a different data-generating regime.

A classic trap:

These are not equivalent because of confounding: some variable Z (e.g., disease severity) might influence both drug-taking and recovery. Then P(Y | X) mixes multiple mechanisms.

The core objects: interventions, DAGs, and identification #

Causal inference (in the Pearl/DAG framework) revolves around three atomic ideas:

  1. Intervention (do-operator)
  1. Causal directed acyclic graph (DAG)
  1. Identification (often via do-calculus)

Observing vs doing: same symbols, different worlds #

It’s tempting to read P(Y | X=x) and P(Y | do(X=x)) as “conditional probabilities with different notation.” But their meanings differ:

A helpful mental model:

Structural causal model intuition (without getting lost) #

A common semantics for DAGs is a set of structural assignments:

where Uₓ, Uᵧ are exogenous “noise” terms.

Intervening do(X=x) replaces the equation for X with:

Everything else stays the same. This formalizes “cutting incoming arrows into X.”

Why topological order matters (a prerequisite connection) #

Because a causal graph is acyclic, it admits a topological order. That’s not just an algorithmic convenience: it reflects a generative ordering—causes upstream, effects downstream. When we factor a joint distribution according to a DAG,

P(V₁, …, Vₙ) = ∏ᵢ P(Vᵢ | Pa(Vᵢ))

we are implicitly using that ordering. Under intervention, we alter exactly one (or more) of those factors.

What causal inference promises—and what it requires #

Causal inference can answer “what if we change X?” questions from observational data, but only if you are willing to:

This is a feature, not a bug: it forces you to separate what the data say from what your causal assumptions add.

Core Mechanic 1: Causal DAGs, Paths, and the Difference Between Confounding and Selection #

Why DAGs are the right abstraction #

If you only use probability tables, every dependence can be “explained” by many stories. DAGs provide a language of mechanisms: they constrain which variables can directly influence which others. That constraint lets you reason about which associations are causal and which are spurious.

A DAG is a directed acyclic graph where:

The DAG does not automatically tell you the effect size; it tells you what adjustments are needed to estimate causal effects from observational data.

Three fundamental causal motifs #

Most causal reasoning problems are combinations of three motifs:

1) Chain (mediation) #

X → M → Y

2) Fork (confounding) #

X ← Z → Y

3) Collider (selection) #

X → C ← Y

This collider rule is one of the most counterintuitive pieces of causal inference and is responsible for many real-world errors (e.g., selection bias, Berkson’s paradox).

d-separation: reading independences from a DAG #

A DAG implies conditional independence constraints via d-separation. The idea is to determine whether all paths between two sets of variables are blocked by a conditioning set.

A path is blocked if it contains:

While the full formal definition can be dense, the practical payoff is clear:

Backdoor paths: the key to confounding #

When estimating the causal effect of X on Y, you worry about paths that create association between X and Y that is not due to the directed causal influence X → … → Y.

A backdoor path from X to Y is any path that begins with an arrow into X.

Example:

Backdoor paths matter because they transmit non-causal association. If you can block all backdoor paths (without introducing new bias), then the remaining association corresponds to the causal effect.

Adjustment sets (backdoor criterion) #

A set of variables S satisfies the backdoor criterion relative to (X, Y) if:

  1. No node in S is a descendant of X.

  2. S blocks every backdoor path from X to Y.

If such an S exists, then the causal effect is identified via adjustment:

P(Y | do(X=x)) = ∑ₛ P(Y | X=x, S=s) P(S=s)

This looks like “standardize over S,” but the why is causal: conditioning on S blocks spurious paths, and averaging over S restores the overall population distribution.

Confounding vs selection: two different dangers #

It’s easy to say “we’ll control for variables.” But which variables you control for matters.

A compact comparison:

StructureGraphWhat happens if you condition on the middle node?Typical name
ForkX ← Z → YBlocks spurious association (good)Confounding
ColliderX → C ← YCreates spurious association (bad)Selection bias
ChainX → M → YBlocks mediated effect (depends on goal)Mediation

Latent variables and hidden confounding #

Real systems often include unobserved variables U (e.g., genetics, socioeconomic factors). If U causes both X and Y, you have an unblocked backdoor path X ← U → Y, but you can’t condition on U.

Then adjustment may fail even if you “control for everything you measured.” This is where identification becomes subtle: you may still identify effects via alternative structures (e.g., frontdoor), instruments, or additional assumptions.

A note on notation: vectors vs scalar variables #

You may see causal effects with covariate vectors Z. The adjustment formula generalizes:

P(Y | do(X=x)) = ∑_z P(Y | X=x, Z=z) P(Z=z)

(For continuous variables, replace ∑ with ∫.)

The important part is conceptual: you’re averaging the conditional outcome model over the marginal distribution of confounders, not the conditioned-on distribution within a particular X group.

Core Mechanic 2: Interventions, Identification, and Do-Calculus #

Why we need identification rules #

A DAG plus observational data gives you P(V) and conditional distributions like P(Y | X, Z). Your causal target, however, is interventional: P(Y | do(X)).

Identification asks:

This matters because randomized experiments directly approximate do(X) by design, but observational studies do not. Identification is the bridge.

The truncated factorization (intervention) formula #

If the joint distribution factorizes along the DAG as:

P(V) = ∏ᵢ P(Vᵢ | Pa(Vᵢ))

then intervening do(X=x) yields:

P(V | do(X=x)) = 𝟙[X=x] ∏_{Vᵢ ≠ X} P(Vᵢ | Pa(Vᵢ))

Equivalently, for the post-intervention distribution over non-intervened variables:

P(V{X} | do(X=x)) = ∏_{Vᵢ ≠ X} P(Vᵢ | Pa(Vᵢ))\big|_{X=x}

This is the formal “cut incoming arrows into X” idea.

Backdoor and frontdoor as special identification results #

Two major identification patterns appear so often they deserve names.

Backdoor adjustment (confounding control) #

If S satisfies the backdoor criterion, then:

P(Y | do(X=x)) = ∑ₛ P(Y | X=x, S=s) P(S=s)

Frontdoor adjustment (mediated identification under hidden confounding) #

Frontdoor applies when:

Then:

P(Y | do(X=x)) = ∑ₘ P(M=m | X=x) ; ∑_{x'} P(Y | M=m, X=x') P(X=x')

Intuition:

Do-calculus: the general tool #

Backdoor/frontdoor are consequences of do-calculus. Do-calculus provides rules for transforming expressions like P(Y | do(X), Z) into forms where the do(·) can be removed.

You don’t need to memorize all details to use causal inference effectively, but you do need to understand what the rules do:

The three rules (high-level) #

Let X, Y, Z, W be disjoint sets of nodes. Let G be the causal DAG.

We define modified graphs:

Then the do-calculus rules (informally) are:

  1. Insertion/deletion of observations

If Y is d-separated from Z given X and W in G\bar{X}, then:

P(Y | do(X), Z, W) = P(Y | do(X), W)

  1. Action/observation exchange

If Y is d-separated from Z given X and W in G\bar{X}\underline{Z}, then:

P(Y | do(X), do(Z), W) = P(Y | do(X), Z, W)

  1. Insertion/deletion of actions

If Y is d-separated from Z given X and W in G\bar{X}, where Z has no ancestors in X after modification (more precisely via graph surgery conditions), then:

P(Y | do(X), do(Z), W) = P(Y | do(X), W)

These statements can look intimidating because of the graph modifications, but the core idea is consistent:

Identification workflow (practical) #

When handed a causal query, a DAG, and observational data, a practical workflow is:

  1. State the target: e.g., P(Y | do(X=x)).

  2. List candidate adjustment variables using backdoor (if possible).

  3. If backdoor fails due to unobserved confounding, check frontdoor conditions.

  4. If neither applies, use general do-calculus / algorithmic tools (e.g., ID algorithm) to test identification.

  5. If not identifiable, redesign: collect more variables, use instruments, exploit experiments, or accept partial identification bounds.

A careful derivation: why adjustment works #

Suppose S blocks all backdoor paths from X to Y and contains no descendants of X.

We want to show:

P(Y | do(X=x)) = ∑ₛ P(Y | X=x, S=s) P(S=s)

Sketch with “show your work” steps (conceptual algebra):

  1. Start with law of total probability under intervention:

P(Y | do(X=x))

= ∑ₛ P(Y, S=s | do(X=x))

= ∑ₛ P(Y | S=s, do(X=x)) P(S=s | do(X=x))

  1. Because S has no descendants of X and we only intervene on X, distribution of S is unchanged:

P(S=s | do(X=x)) = P(S=s)

  1. Because S blocks backdoor paths from X to Y, Y is conditionally independent of the intervention once we condition on (X, S):

P(Y | S=s, do(X=x)) = P(Y | X=x, S=s)

  1. Substitute back:

P(Y | do(X=x))

= ∑ₛ P(Y | X=x, S=s) P(S=s)

The DAG is doing the heavy lifting in steps (2) and (3). The formula is “statistics,” but the justification is “causality.”

Estimation vs identification (don’t conflate them) #

Identification is symbolic: it tells you what expression equals the causal effect in the population.

Estimation is numerical: given finite data, how do you estimate that expression?

Even if a causal effect is identifiable, estimation can still be hard due to:

This lesson focuses on identification logic, but keep the distinction clear: do-calculus answers “can we,” not “how well with this dataset.”

Application/Connection: From Causal Questions to Analysis Plans #

Why causal inference shows up everywhere #

Modern ML and data science frequently optimize predictive accuracy: estimate P(Y | X). But decision-making needs causal quantities: what happens to Y if we change X?

Examples:

In each case, you are implicitly asking about P(Y | do(X)).

Turning a vague question into a causal estimand #

A strong habit: translate English into an estimand.

Once you have the estimand, you can ask: is it identifiable from available data and assumptions?

Choosing an identification strategy: a decision table #

SituationTypical DAG symptomIdentification moveNotes
Measured confoundingX ← Z → Y with observed ZBackdoor adjustmentMost common, but don’t condition on colliders
Hidden confounding but observed mediatorX → M → Y and X ↔ Y confounded, but M measuredFrontdoor adjustmentRequires strong structural assumptions
Randomized experimentdo(X) approximated by designNo adjustment needed (in principle)Still adjust for precision / imbalance
Selection biasConditioning on S where X → S ← YAvoid / model selectionOften sneaks in via “only users who…”
Unidentifiable effectNo valid adjustment; hidden confounding; complex feedback avoided by DAGCollect more variables, use instruments, partial IDHonest “can’t answer” is a valid result

Example connection: Bayesian inference as estimation machinery #

You already know Bayesian inference. Once identification gives you an estimand like:

P(Y | do(X=x)) = ∑ₛ P(Y | X=x, S=s) P(S=s)

you can estimate the components using Bayesian models:

Causal inference tells you what to estimate; Bayesian inference tells you how to estimate with uncertainty.

Causal inference and ML: where the DAG matters #

Some common ML pitfalls become clearer with DAGs:

  1. Target leakage

If a feature is a descendant of the label (Y → Feature), the model will “predict” using consequences of Y. That’s not causally meaningful.

  1. Controlling for post-treatment variables

If you adjust for a mediator M (X → M → Y) while trying to estimate the total effect of X, you will generally underestimate it.

  1. Selection on engagement

Analyzing “active users only” can create collider bias: Feature change (X) affects engagement (C), and user satisfaction (Y) affects engagement; conditioning on active users selects on C.

A realistic end-to-end analysis plan (template) #

When you face a causal question in practice:

  1. Specify variables: treatment X, outcome Y, potential confounders Z, mediators M, selection variables S.

  2. Draw a DAG: even if imperfect, it makes assumptions explicit.

  3. Decide estimand: ATE, conditional effect, policy value, etc.

  4. Identify using backdoor/frontdoor/do-calculus.

  5. Assess assumptions:

  1. Estimate with appropriate methods (regression adjustment, matching, IPW, doubly robust, Bayesian models).

  2. Sensitivity analysis for unmeasured confounding.

The conceptual leap is steps (2)–(4): that is what causal inference adds beyond standard statistics.

Worked Examples (3) #

Backdoor adjustment: estimating the effect of a treatment with a measured confounder #

Variables: X = treatment (0/1), Y = recovery (0/1), Z = severity (0/1). DAG: Z → X, Z → Y, and X → Y. Goal: identify P(Y | do(X=1)) and the ATE.

Assume Z is observed.

    1. Identify backdoor paths from X to Y.
    • •There is a directed causal path X → Y (this is the effect we want).
    • •There is a backdoor path X ← Z → Y (starts with arrow into X).
    1. Choose an adjustment set S.
    • •Candidate: S = {Z}.
    • •Check backdoor criterion:

    (i) Z is not a descendant of X (true).

    (ii) Conditioning on Z blocks the path X ← Z → Y (true).

    So Z is a valid backdoor adjustment set.

    1. Write the adjustment formula.

    P(Y | do(X=1)) = ∑_z P(Y | X=1, Z=z) P(Z=z)

    Similarly,

    P(Y | do(X=0)) = ∑_z P(Y | X=0, Z=z) P(Z=z)

    1. Convert to an ATE expression (difference in expectations).

    Let Y be binary, then E[Y | do(X=x)] = P(Y=1 | do(X=x)).

    ATE = E[Y | do(X=1)] − E[Y | do(X=0)]

    = ∑_z [P(Y=1 | X=1, Z=z) − P(Y=1 | X=0, Z=z)] P(Z=z).

    1. Interpret.

    Within each severity stratum Z=z, compare treated vs untreated (a like-for-like comparison), then average those differences over how common each severity level is in the population.

Insight: The key move is not “control for Z because it predicts Y,” but “control for Z because it opens/closes causal paths.” Z is required because it is a common cause of X and Y. The DAG explains why the standardization ∑_z (…) P(Z=z) matches an intervention.

Collider bias: why adjusting for a selection variable can create a fake causal effect #

Variables: X = skill (continuous), Y = friendliness (continuous), C = hired (0/1). DAG: X → C ← Y. You only observe people who are hired (C=1) and then you compute the correlation between X and Y.

Question: Can conditioning on C create an association between X and Y even if they are independent marginally?

    1. Read the DAG.
    • •X and Y both cause C.
    • •There is no arrow between X and Y, so the model allows X ⟂ Y marginally (they can be independent in the population).
    1. Identify the path between X and Y.
    • •There is one path: X → C ← Y.
    • •C is a collider on this path.
    1. Apply collider rule.
    • •If we do not condition on C (and none of C’s descendants are conditioned on), the path through the collider is blocked.

    So X and Y can remain independent in the overall population.

    1. Condition on C=1 (selection).
    • •Conditioning on a collider opens the path X → C ← Y.
    • •Intuition: among hired people, if someone has low X (skill), they must “compensate” with higher Y (friendliness) to be hired, and vice versa.

    This induces a negative association between X and Y within the selected set.

    1. Consequence.

    If you regress Y on X using only hired people, you may conclude X ‘causes’ Y or at least that they are strongly related, but the association is an artifact of conditioning on C.

Insight: “Control for everything you can” is not safe. Conditioning is an operation on distributions, not a free improvement. DAGs tell you when conditioning opens paths (colliders) and thereby manufactures correlations that were not present before.

Frontdoor adjustment: identifying a causal effect with unobserved confounding #

Variables: X = smoking (0/1), M = tar exposure (continuous), Y = lung disease (0/1), U = genetic risk (unobserved). DAG: U → X and U → Y (hidden confounding), X → M → Y, and no direct arrow X → Y.

Assume: (i) all causal effect of X on Y goes through M, (ii) no unobserved confounding between X and M, (iii) no unobserved confounding between M and Y given X.

Goal: identify P(Y | do(X=x)) from observational data.

    1. Recognize why backdoor fails.
    • •There is a backdoor path X ← U → Y.
    • •U is unobserved, so we cannot condition on it.

    Thus no standard backdoor adjustment is available (in this simplified setup).

    1. Check frontdoor conditions.
    • •X affects M (X → M) and M affects Y (M → Y): mediator observed.
    • •All directed paths from X to Y go through M: satisfied by assumption (no X → Y edge).
    • •No unblocked backdoor from X to M: U does not cause M (assumed), so OK.
    • •Backdoor paths from M to Y are blocked by conditioning on X: since U affects Y and X but not M, conditioning on X blocks M ← X ← U → Y type paths (under the stated assumptions).
    1. Write the frontdoor formula.

    P(Y | do(X=x)) = ∑_m P(M=m | X=x) ∑_{x'} P(Y | M=m, X=x') P(X=x')

    1. Explain the two-stage averaging.
    • •First term P(M | X=x): how changing X changes mediator M (estimable observationally because no confounding between X and M).
    • •Second term: compute P(Y | do(M=m)) indirectly by averaging P(Y | M=m, X=x') over the natural distribution of X (this step neutralizes the confounding between X and Y because we are not comparing groups at fixed X; we are using X as a stratifier to learn the M → Y relationship).
    1. Result.

    The causal effect of X on Y is identified despite unobserved U, because the mediator M provides a measurable pathway that “transmits” the effect and can be isolated with the frontdoor logic.

Insight: Frontdoor is a powerful reminder that hidden confounding does not automatically make causal inference impossible. But it replaces “measure confounders” with stronger structural assumptions about mediation and the absence of certain confounders—assumptions you must defend scientifically.

Key Takeaways #

Common Mistakes #

Practice #

easy

Backdoor practice: Consider the DAG Z → X → Y and Z → Y. (a) List all backdoor paths from X to Y. (b) Does S={Z} satisfy the backdoor criterion? (c) Write P(Y | do(X=x)) in terms of observational quantities.

Hint: A backdoor path must start with an arrow into X. In this graph, check X ← Z → Y.

Show solution

(a) The backdoor path is X ← Z → Y.

(b) Yes. Z is not a descendant of X, and conditioning on Z blocks X ← Z → Y.

(c) P(Y | do(X=x)) = ∑_z P(Y | X=x, Z=z) P(Z=z).

medium

Collider reasoning: Suppose X → C ← Y and additionally C → D (D is a descendant of the collider). Are X and Y independent given D? Explain using collider logic.

Hint: Conditioning on a collider or any of its descendants opens the path through the collider.

Show solution

In X → C ← Y, C is a collider, so the path between X and Y is blocked marginally. But D is a descendant of C. Conditioning on D provides information about C, which effectively conditions on (or partially conditions on) the collider. This opens the path X → C ← Y, inducing an association between X and Y given D. So X and Y are generally not independent given D.

hard

Frontdoor check: You observe X, M, Y with DAG X → M → Y, and there is an unobserved U such that U → X and U → Y. Additionally, suppose there is also an unobserved W such that W → M and W → Y. Is the causal effect of X on Y identifiable by frontdoor? Why or why not?

Hint: Frontdoor requires no unblocked confounding between M and Y given X. Hidden common causes of M and Y break that.

Show solution

No, not by the standard frontdoor criterion. The unobserved W creates confounding between M and Y via M ← W → Y. Even conditioning on X does not block this backdoor path because W is not a descendant of X and is unobserved. Therefore the relationship between M and Y cannot be learned unbiasedly from P(Y | M, X), and the frontdoor formula is not justified.

Connections #

Prerequisites you’re using here:

Next nodes you’ll likely want:

Quality: A (4.3/5)

← back to treebrowse all →