Automatic Differentiation

←Back to Tech Tree

inventorycoverage

Automatic Differentiation #

AlgorithmsDifficulty: ★★★★☆Depth: 1Unlocks: 4

An algorithmic technique to compute exact derivatives of functions expressed as programs by applying the chain rule systematically (forward- or reverse-mode), which underlies efficient backpropagation in modern frameworks. Knowing AD helps in understanding gradient computation costs and implementation choices.

Interactive Visualization #

⏮◀◀▶▶STEP0.25x1xZOOM

t=0s

Core Concepts #

Key Symbols & Notation #

J_f or D[f] : the Jacobian (matrix of partial derivatives) of function f

Essential Relationships #

Prerequisites (1) #

Computational Graphs6 atoms

Unlocks (1) #

Deep Learninglvl 5

Advanced Learning Details

Graph Position #

11

Depth Cost

4

Fan-Out (ROI)

2

Bottleneck Score

1

Chain Length

Cognitive Load #

5

Atomic Elements

30

Total Elements

L1

Percentile Level

L3

Atomic Level

All Concepts (13) #

Teaching Strategy #

Self-serve tutorial - low prerequisites, straightforward concepts.

You already know computational graphs show how values flow. Automatic differentiation (AD) is what makes those graphs actionable: it turns “a program that computes a number” into “a program that computes that number and its exact derivative”, with predictable computational cost. This is the algorithmic heart of backpropagation.

TL;DR:

Automatic differentiation computes exact derivatives of functions written as programs by systematically applying the chain rule to a trace of primitive operations. Forward-mode propagates tangents (directional derivatives) alongside values and is efficient when inputs are few. Reverse-mode propagates adjoints (sensitivities) backward and is efficient when outputs are few (e.g., scalar loss), which is why it powers backprop in deep learning.

What Is Automatic Differentiation? #

Why AD exists (and why you should care) #

When you want derivatives, you have several options:

Deep learning depends on fast and reliable gradients. If your loss is a scalar ℓ and your parameters are θ, you typically need ∇_θ ℓ. AD gives you those gradients with a cost that you can reason about.

Definition #

Let f be a function implemented as a program:

The Jacobian is

J_f(x) = Df = [∂yᵢ/∂xⱼ] ∈ ℝᵐˣⁿ

Automatic differentiation is an algorithmic technique to compute J_f(x) (or products with it) by decomposing f into a sequence of primitive operations whose derivatives are known, then applying the chain rule systematically along that sequence.

Programs, traces, and primitive operations #

A key AD idea: your code (or computational graph) induces a sequence of intermediate values:

x → v₁ → v₂ → … → v_k → y

Each step is a primitive operation like +, ×, sin, exp, matmul, etc.

Example (scalar):

f(x) = sin(x) + x²

A “trace” might look like:

AD differentiates this trace.

“Exact” but not symbolic #

AD is not algebraic simplification. It does not try to build and simplify a symbolic formula for df/dx. Instead, it evaluates derivatives numerically, step-by-step, using exact local derivative rules.

So the result is exact with respect to the executed operations (subject to floating point), and it naturally supports:

Two main modes #

AD comes in two main propagation styles:

They compute different Jacobian-related objects efficiently:

ModeEfficiently computesTypical use case
ForwardJ_f(x) v (Jacobian-vector product, JVP)Few inputs, many outputs; sensitivity in a direction
Reversewᵀ J_f(x) (vector-Jacobian product, VJP)Many inputs, few outputs (often m = 1)

Deep learning training typically has m = 1 (a scalar loss), and n can be millions of parameters. That asymmetry makes reverse-mode the star.

AD vs “backprop” #

Backpropagation is best understood as reverse-mode AD applied to a computational graph. The deep learning context often adds batching, tensor primitives, and GPU kernels, but the underlying algorithm is reverse-mode AD.

The main learning goal for this node: understand what is being propagated, why the two modes have different cost profiles, and how the chain rule is organized algorithmically.

Core Mechanic 1: Forward-Mode AD (Tangent Propagation) #

Why forward-mode is a good first step #

Forward-mode is conceptually straightforward: as you compute each intermediate value v, you also compute how v changes with respect to a chosen input direction.

Think: “If I nudge the input x a tiny amount along direction u, how does the output change?”

This is a directional derivative:

d/dε f(x + εu) |_{ε=0} = J_f(x) u

Forward-mode computes exactly this quantity in one pass.

Dual numbers intuition (scalar case) #

A classic way to explain forward-mode is dual numbers. Replace a real number x with

x̂ = x + ẋ ε

where ε is a special symbol with ε² = 0 (but ε ≠ 0). Then:

If you initialize ẋ = 1, then the ε coefficient at the end becomes df/dx.

Example: for multiplication,

(a + ȧ ε)(b + ḃ ε)

= ab + (a ḃ + b ȧ) ε

The ε coefficient exactly matches the product rule.

AD implementations don’t need to literally use ε, but the algebra captures the idea: push derivative information forward alongside values.

Forward-mode on a trace #

Suppose a program defines intermediates vᵢ = opᵢ(parents). Forward-mode maintains (vᵢ, v̇ᵢ), where v̇ᵢ is the directional derivative of vᵢ in the chosen input direction.

If vᵢ = g(vⱼ, v_k), then by chain rule:

v̇ᵢ = (∂g/∂vⱼ) v̇ⱼ + (∂g/∂v_k) v̇_k

You can see the pattern: local partial derivatives times incoming tangents.

Vector inputs: JVPs #

For x ∈ ℝⁿ, choose a direction u ∈ ℝⁿ. Initialize input tangents:

= u

Run the program forward; each intermediate carries its tangent. The output tangent is:

= J_f(x) u

This is a Jacobian-vector product (JVP).

If you want the full Jacobian J_f, you could run forward-mode n times with u = eⱼ (standard basis vectors). That costs O(n) forward sweeps.

Cost model (the key reason modes differ) #

Forward-mode cost is roughly:

The big scaling fact:

So forward-mode is attractive when n (inputs) is small, even if m is large.

Practical benefits of forward-mode #

Forward-mode shines in situations like:

It is also valuable in deep learning systems as a building block (for higher-order derivatives and for efficient JVP computation).

Example of forward-mode update rules #

For common scalar primitives (primal v, tangent v̇):

For vectors/matrices, the same principles apply with appropriate linear algebra. For example, if y = A x, then ẏ = A ẋ (a linear map).

Breathing room: what forward-mode is not #

Forward-mode sets the stage: it makes the chain rule feel like a local propagation law. Reverse-mode will use the same local derivatives but run the information in the opposite direction.

Core Mechanic 2: Reverse-Mode AD (Adjoint/Backpropagation) #

Why reverse-mode exists #

In machine learning you often have:

You want ∇_θ ℓ, i.e., all ∂ℓ/∂θⱼ.

Forward-mode would require n separate passes (one per θⱼ direction). Reverse-mode computes all partials in about the cost of a small constant times one evaluation of f.

The central object: adjoints #

Reverse-mode defines, for each intermediate vᵢ, an adjoint (also called sensitivity):

v̄ᵢ = ∂ℓ/∂vᵢ

Read it as: “how much does the final scalar output ℓ change if vᵢ changes?”

This is exactly what backprop stores as “gradients with respect to activations.”

Local chain rule as a pullback #

Suppose an intermediate is computed by

vᵢ = g(vⱼ, v_k)

Given v̄ᵢ, we can distribute it to parents:

v̄ⱼ += v̄ᵢ · ∂vᵢ/∂vⱼ

v̄_k += v̄ᵢ · ∂vᵢ/∂v_k

This is the chain rule written in an accumulation form.

Important: the “+=” is not decoration. If a variable feeds into multiple downstream paths, its total sensitivity is the sum of contributions from each path.

Reverse-mode algorithm sketch #

  1. 1)Forward pass: compute and store all intermediates vᵢ needed to evaluate local derivatives later.
  2. 2)Initialize: set output adjoint ℓ̄ = ∂ℓ/∂ℓ = 1.
  3. 3)Backward pass: traverse operations in reverse topological order, applying local derivative rules to accumulate adjoints to inputs.

At the end, θ̄ = ∂ℓ/∂θ = ∇_θ ℓ.

Why storage matters #

Reverse-mode needs values from the forward pass to compute local derivatives. Example:

If v = sin(a), then ∂v/∂a = cos(a). During backward, you need a (or v) to compute cos(a).

So reverse-mode typically stores (“tapes”) intermediate values.

This leads to a real implementation tradeoff:

StrategyMemoryComputeNotes
Store all intermediatesHighLowStandard backprop
Recompute some intermediates (checkpointing)LowerHigherUseful for very deep nets

Reverse-mode yields VJPs #

For general f: ℝⁿ → ℝᵐ, reverse-mode naturally computes

wᵀ J_f(x)

for a chosen output weighting w ∈ ℝᵐ.

When m = 1 and ℓ = f(x), w = 1 and

1ᵀ J_f(x) = ∇_x

So gradients are a special case of a vector-Jacobian product (VJP).

Cost model (the other key reason modes differ) #

Let C be the cost of evaluating f once.

So reverse-mode is efficient when m (outputs) is small, especially m = 1.

A gentle derivation on a tiny graph #

Consider:

v₁ = x

v₂ = y

v₃ = v₁ v₂

v₄ = sin(v₁)

v₅ = v₃ + v₄

ℓ = v₅

Adjoints:

So:

∂ℓ/∂x = v̄₁ = y + cos(x)

∂ℓ/∂y = v̄₂ = x

This is the chain rule, reorganized so you reuse shared subcomputations efficiently.

Reverse-mode is not “reverse evaluation” #

A subtle but important distinction:

Even if an operation is not invertible (e.g., squaring), reverse-mode still works because it doesn’t need to invert; it needs partial derivatives.

Nondifferentiabilities and subgradients #

Real programs may include primitives like ReLU(x) = max(0, x), which is not differentiable at x = 0.

Frameworks typically define a convention (a subgradient) such as:

d/dx ReLU(x) = 0 if x < 0, 1 if x > 0, and a chosen value at x = 0 (often 0)

AD will propagate according to those primitive rules.

Reverse-mode is the workhorse for deep learning, but remember: it is one mode of a general AD idea. The real power comes from understanding both modes and the Jacobian products they compute.

Application/Connection: Backprop, Jacobian Products, and Implementation Choices #

Backpropagation = reverse-mode AD on a computational graph #

In neural networks, you define a scalar loss:

ℓ = L(model(x; θ), target)

Training needs ∇_θ ℓ. The computation of ℓ is a composition of many layers, each a primitive (or a bundle of primitives). Reverse-mode AD traverses the graph backward, accumulating adjoints.

So when you see “.backward()” in a framework, it’s executing reverse-mode AD.

Why Jacobian products matter more than Jacobians #

In high dimensions, explicitly building J_f is often impossible:

AD avoids this by computing products:

These are enough for many algorithms:

Higher-order derivatives (brief but important) #

If you want second derivatives, you can apply AD to an AD-produced derivative computation.

For scalar ℓ(x):

A common strategy is:

  1. 1)Compute g(x) = ∇ℓ(x) via reverse-mode.
  2. 2)Compute J_g(x) v via forward-mode on g.

This is “forward-over-reverse” and can compute H v efficiently.

Choosing a mode: a practical rule #

Let f: ℝⁿ → ℝᵐ.

So:

GoalBetter modeWhy
Full Jacobian J ∈ ℝᵐˣⁿforward if n ≪ m, reverse if m ≪ nOne sweep per “small side”
Gradient of scalar ℓreversem = 1
JVPforwardcomputes J v directly
VJPreversecomputes wᵀ J directly

AD and control flow #

Real programs include:

AD works by differentiating the executed trace. That means:

This is usually what you want in ML, but it matters for correctness and debugging.

Operator overloading vs source transformation #

Implementations generally fall into two families:

ApproachHow it worksProsCons
Operator overloadingRun code with special number/tensor types that record a tapeEasy to integrate with dynamic languagesRuntime overhead; tracing can be subtle
Source transformationTransform the program (or IR) into a new program that computes derivativesCan be faster; more compile-time optimizationHarder to implement; language/IR constraints

Modern frameworks blend ideas (e.g., tracing to an IR, then compiling).

Numerical stability: AD isn’t a magic fix #

AD avoids finite-difference error, but it cannot fix a numerically unstable primal computation.

Example: if your function involves subtracting nearly equal numbers, the primal has catastrophic cancellation; the derivative will reflect that sensitivity.

So you still need good numerical practices.

Connecting back to deep learning #

Backprop is reverse-mode AD, and its efficiency is why deep learning is feasible at scale.

But understanding AD gives you extra superpowers:

Most importantly: you stop treating gradients as “magic,” and start seeing them as a carefully organized application of the chain rule.

Worked Examples (3) #

Forward-mode AD on a small function (compute a JVP) #

Let f: ℝ² → ℝ² be

f(x₁, x₂) = (y₁, y₂)

where

y₁ = x₁ x₂ + sin(x₁)

y₂ = exp(x₂) + x₁²

Compute the Jacobian-vector product J_f(x) u at x = (1, 0) along direction u = (2, -1).

  1. Write a trace with intermediates:

    v₁ = x₁

    v₂ = x₂

    v₃ = v₁ v₂

    v₄ = sin(v₁)

    v₅ = v₃ + v₄ (this is y₁)

    v₆ = exp(v₂)

    v₇ = v₁²

    v₈ = v₆ + v₇ (this is y₂)

    We will propagate tangents v̇ᵢ.

  2. Initialize primals at x = (1, 0):

    v₁ = 1

    v₂ = 0

    Initialize tangents from u:

    v̇₁ = 2

    v̇₂ = -1

  3. Forward propagate:

    v₃ = v₁ v₂ = 1·0 = 0

    v̇₃ = v̇₁ v₂ + v₁ v̇₂ = 2·0 + 1·(-1) = -1

  4. Next:

    v₄ = sin(v₁) = sin(1)

    v̇₄ = cos(v₁) v̇₁ = cos(1)·2 = 2cos(1)

  5. Add for y₁:

    v₅ = v₃ + v₄ = 0 + sin(1) = sin(1)

    v̇₅ = v̇₃ + v̇₄ = -1 + 2cos(1)

  6. For y₂ part:

    v₆ = exp(v₂) = exp(0) = 1

    v̇₆ = exp(v₂) v̇₂ = 1·(-1) = -1

  7. Square:

    v₇ = v₁² = 1² = 1

    v̇₇ = 2 v₁ v̇₁ = 2·1·2 = 4

  8. Add for y₂:

    v₈ = v₆ + v₇ = 1 + 1 = 2

    v̇₈ = v̇₆ + v̇₇ = -1 + 4 = 3

  9. Collect the JVP:

    = (v̇₅, v̇₈) = ( -1 + 2cos(1), 3 )

Insight: Forward-mode naturally returns J_f(x)u in one sweep, without building J_f. You chose u; AD gave the directional derivative of the entire program along that direction.

Reverse-mode AD on a scalar loss (classic gradient via adjoints) #

Let ℓ(x, y) be

ℓ = (x y + sin x)²

Compute ∂ℓ/∂x and ∂ℓ/∂y using reverse-mode AD, and evaluate at (x, y) = (1, 2).

  1. Build a trace:

    v₁ = x

    v₂ = y

    v₃ = v₁ v₂

    v₄ = sin(v₁)

    v₅ = v₃ + v₄

    v₆ = v₅²

    ℓ = v₆

  2. Forward pass at (1, 2):

    v₁ = 1

    v₂ = 2

    v₃ = 1·2 = 2

    v₄ = sin(1)

    v₅ = 2 + sin(1)

    v₆ = (2 + sin(1))²

  3. Initialize adjoints (bars). Since ℓ = v₆:

    v̄₆ = ∂ℓ/∂v₆ = 1

  4. Backprop through v₆ = v₅²:

    ∂v₆/∂v₅ = 2 v₅

    v̄₅ += v̄₆ · 2 v₅ = 1 · 2(2 + sin(1)) = 2(2 + sin(1))

  5. Backprop through v₅ = v₃ + v₄:

    v̄₃ += v̄₅ · 1 = 2(2 + sin(1))

    v̄₄ += v̄₅ · 1 = 2(2 + sin(1))

  6. Backprop through v₄ = sin(v₁):

    ∂v₄/∂v₁ = cos(v₁) = cos(1)

    v̄₁ += v̄₄ · cos(1)

    = 2(2 + sin(1)) cos(1)

  7. Backprop through v₃ = v₁ v₂:

    ∂v₃/∂v₁ = v₂ = 2

    ∂v₃/∂v₂ = v₁ = 1

    v̄₁ += v̄₃ · v₂

    = 2(2 + sin(1)) · 2

    = 4(2 + sin(1))

    v̄₂ += v̄₃ · v₁

    = 2(2 + sin(1)) · 1

    = 2(2 + sin(1))

  8. Now read off gradients:

    ∂ℓ/∂x = v̄₁ = 4(2 + sin(1)) + 2(2 + sin(1))cos(1)

    ∂ℓ/∂y = v̄₂ = 2(2 + sin(1))

Insight: Reverse-mode computes all input partials of a scalar output in one backward sweep by accumulating adjoints. Notice how v̄₁ receives contributions from two downstream paths (through v₃ and v₄), which is why accumulation is essential.

Mode comparison on the same function: full Jacobian costs #

Consider f: ℝ³ → ℝ². You want the full Jacobian J_f ∈ ℝ²ˣ³ at a point. Compare forward-mode and reverse-mode sweep counts conceptually.

  1. Forward-mode computes one JVP J_f(x) u per forward sweep.

    To get all 3 columns of J_f, run 3 sweeps with u = e₁, e₂, e₃.

    So: ~3 forward sweeps.

  2. Reverse-mode computes one VJP wᵀ J_f(x) per backward sweep (after one forward to record the trace).

    To get all 2 rows of J_f, run 2 backward sweeps with w = e₁ and e₂.

    So: ~2 backward sweeps (plus the shared forward for values/tape).

  3. Conclude: if m < n (2 < 3 here), reverse-mode needs fewer sweeps to form the full Jacobian; if n < m, forward-mode would.

Insight: The rule “forward is good when inputs are few; reverse is good when outputs are few” is not folklore—it comes directly from how many basis directions you must seed to recover the full Jacobian.

Key Takeaways #

Common Mistakes #

Practice #

easy

Let f(x) = exp(x) sin(x). Use forward-mode AD (dual-number style) to compute df/dx at x = 0.

Write the primal and tangent updates step-by-step.

Hint: Trace: v₁ = x, v₂ = exp(v₁), v₃ = sin(v₁), v₄ = v₂ v₃. Initialize v̇₁ = 1.

Show solution

Trace and tangents:

v₁ = x, v̇₁ = 1

At x = 0: v₁ = 0

v₂ = exp(v₁) = exp(0) = 1

v̇₂ = exp(v₁) v̇₁ = 1·1 = 1

v₃ = sin(v₁) = sin(0) = 0

v̇₃ = cos(v₁) v̇₁ = cos(0)·1 = 1

v₄ = v₂ v₃ = 1·0 = 0

v̇₄ = v̇₂ v₃ + v₂ v̇₃ = 1·0 + 1·1 = 1

So df/dx at x = 0 equals v̇₄ = 1.

medium

Reverse-mode practice. Define ℓ(x, y, z) = (x y + z) · sin(y).

Compute ∂ℓ/∂x, ∂ℓ/∂y, ∂ℓ/∂z using reverse-mode AD (show the trace, then adjoint propagation).

Hint: A good trace: v₁=x, v₂=y, v₃=z, v₄=v₁ v₂, v₅=v₄+v₃, v₆=sin(v₂), v₇=v₅ v₆ = ℓ.

Show solution

Trace:

v₁ = x

v₂ = y

v₃ = z

v₄ = v₁ v₂

v₅ = v₄ + v₃

v₆ = sin(v₂)

v₇ = v₅ v₆

ℓ = v₇

Adjoints (initialize): v̄₇ = 1

Backprop v₇ = v₅ v₆:

v̄₅ += v̄₇ · v₆ = 1·v₆ = v₆

v̄₆ += v̄₇ · v₅ = 1·v₅ = v₅

Backprop v₆ = sin(v₂):

v̄₂ += v̄₆ · cos(v₂) = v₅ cos(v₂)

Backprop v₅ = v₄ + v₃:

v̄₄ += v̄₅ · 1 = v₆

v̄₃ += v̄₅ · 1 = v₆

Backprop v₄ = v₁ v₂:

v̄₁ += v̄₄ · v₂ = v₆ v₂

v̄₂ += v̄₄ · v₁ = v₆ v₁

Collect:

∂ℓ/∂x = v̄₁ = y sin(y)

∂ℓ/∂z = v̄₃ = sin(y)

∂ℓ/∂y = v̄₂ = (x y + z) cos(y) + x sin(y)

(Using v₅ = x y + z and v₆ = sin(y).)

hard

Let f: ℝⁿ → ℝᵐ. You need the full Jacobian J_f at a point. Suppose one forward evaluation costs C.

  1. Estimate the cost (in units of C) to compute J_f with forward-mode.

  2. Estimate the cost to compute J_f with reverse-mode.

  3. For which regimes of (n, m) is each preferable?

Hint: Forward-mode gives one column per seeded input basis direction; reverse-mode gives one row per seeded output basis direction (after taping).

Show solution

  1. Forward-mode: one sweep gives J_f(x) u. To get all columns, run with u = e₁,…,eₙ. Cost ≈ n·C (up to a small constant factor).

  2. Reverse-mode: one backward sweep gives wᵀ J_f(x). To get all rows, run with w = e₁,…,eₘ. Cost ≈ m·C plus the cost of the forward pass to build/store the trace (often counted once). So ≈ (m+1)·C times constants.

  3. Prefer forward-mode when n ≪ m (few inputs, many outputs). Prefer reverse-mode when m ≪ n (few outputs, many inputs). In ML with scalar loss, m = 1 and n is huge, so reverse-mode is strongly preferred.

Connections #

Next steps and related nodes:

Quality: A (4.5/5)

← back to treebrowse all →