Backpropagation

←Back to Tech Tree

inventorycoverage

Backpropagation #

Machine LearningDifficulty: ★★★★☆Depth: 11Unlocks: 4

Computing gradients efficiently via chain rule through network.

Interactive Visualization #

⏮◀◀▶▶STEP0.25x1xZOOM

t=0s

Core Concepts #

Key Symbols & Notation #

delta_j := dL/dz_j (node upstream gradient / sensitivity)

Essential Relationships #

Prerequisites (2) #

Neural Networks6 atomsMultivariable Chain Rule6 atoms

Unlocks (1) #

Deep Learninglvl 5

Advanced Learning Details

Graph Position #

147

Depth Cost

4

Fan-Out (ROI)

2

Bottleneck Score

11

Chain Length

Cognitive Load #

5

Atomic Elements

39

Total Elements

L2

Percentile Level

L3

Atomic Level

All Concepts (15) #

Teaching Strategy #

Quick unlock - significant prerequisite investment but simple final step. Verify prerequisites first.

Training a neural network is mostly one question repeated millions of times: “If I nudge this weight a tiny bit, how does the loss change?” Backpropagation is the algorithmic answer—an efficient way to compute all those derivatives without redoing work for every parameter.

TL;DR:

Backpropagation computes ∂L/∂(parameters) efficiently by (1) viewing the network as a computation graph, (2) running a forward pass to cache intermediate values, then (3) running a backward pass that propagates sensitivities δⱼ := ∂L/∂zⱼ from outputs to inputs using node-local partial derivatives, summing gradient contributions where paths merge.

What Is Backpropagation? #

The goal: gradients for learning #

A neural network learns by adjusting parameters (weights and biases) to reduce a loss L. Gradient-based optimization methods (like SGD, Momentum, Adam) need derivatives such as ∂L/∂W and ∂L/∂b.

If you tried to compute these derivatives “from scratch” for each parameter independently, you would do a huge amount of repeated work. A modern network can have millions or billions of parameters, and the forward computation reuses intermediate activations across many parameters. We want the same kind of reuse when differentiating.

Backpropagation is exactly that: a systematic reuse of intermediate derivative calculations.

The key viewpoint: computation graphs #

Any forward computation can be represented as a directed acyclic graph (DAG) where:

Example (conceptually):

Even if you write this as “layers,” it’s still a composition of functions, which is what the chain rule is about.

Sensitivities (upstream gradients) #

A central symbol in backprop is the sensitivity of the loss to a node’s pre-activation (or any intermediate variable). For a scalar loss L and a node value zⱼ, define:

δⱼ := ∂L/∂zⱼ

Think of δ as answering: “If zⱼ changes a tiny bit, how much does the loss change?”

Two atomic ideas (the whole algorithm) #

Backpropagation is built from two atomic concepts:

  1. Node-local backward rule

A node converts an incoming loss gradient (upstream gradient) into gradients for its inputs using only its local partial derivatives.

If a node computes:

z = f(x)

and you know the upstream gradient ∂L/∂z, then the chain rule gives:

∂L/∂x = ∂L/∂z · ∂z/∂x

This is “local”: it only needs the node’s own derivative ∂z/∂x and the incoming gradient.

  1. Reverse traversal + summation at merges

You apply these node-local rules in reverse topological order (from outputs back to inputs). If multiple forward paths feed into the same variable, their gradient contributions add.

This “add at merges” is not a special trick—it is exactly the multivariable chain rule.

Why it’s efficient #

Suppose your network computes L in O(C) time (C is the cost of the forward pass). Backprop computes all parameter gradients in about O(C) additional time (often a small constant multiple).

In contrast, computing a gradient for each parameter via finite differences would take O(P·C) where P is the number of parameters. That’s infeasible.

What backprop is not #

The big idea to keep in your head: Forward pass computes values. Backward pass computes sensitivities, reusing cached values and local derivatives.

Core Mechanic 1: The Node-Local Backward Rule (Chain Rule in “Message Passing” Form) #

Why start local? #

A neural network is built from small operations: add, multiply, matrix multiply, activation functions, normalization, etc. If we can differentiate each operation locally, we can differentiate the whole network by composition.

This makes backprop feel like message passing:

Single-input case #

Let a node compute:

z = f(x)

Loss is L(z). By the chain rule:

∂L/∂x = ∂L/∂z · ∂z/∂x

Written as a backward rule:

This is the simplest “local rule.”

Multi-input case: the true neural-network setting #

Most nodes have multiple inputs. Suppose:

z = f(x, y)

Then:

∂L/∂x = ∂L/∂z · ∂z/∂x

∂L/∂y = ∂L/∂z · ∂z/∂y

Same upstream gradient ∂L/∂z fans out to each input, multiplied by the appropriate local partial.

Merge case: when multiple paths contribute #

Now suppose a variable x influences L through two different children nodes, z₁ and z₂:

z₁ = f₁(x)

z₂ = f₂(x)

L = g(z₁, z₂)

Then x affects L through both routes. Multivariable chain rule gives:

∂L/∂x = (∂L/∂z₁)(∂z₁/∂x) + (∂L/∂z₂)(∂z₂/∂x)

This is the “sum contributions at merges” rule.

A tiny computation-graph example (scalar) #

Let:

Forward:

Backward (local rules):

  1. L = v² ⇒ ∂L/∂v = 2v

  2. v = u + y ⇒ ∂v/∂u = 1, ∂v/∂y = 1

So ∂L/∂u = ∂L/∂v · 1

And contribution to ∂L/∂y from this path is ∂L/∂v · 1

  1. u = x · y ⇒ ∂u/∂x = y, ∂u/∂y = x

So ∂L/∂x = ∂L/∂u · y

And contribution to ∂L/∂y from this path is ∂L/∂u · x

Finally, add the two contributions to ∂L/∂y:

∂L/∂y = (from v) + (from u)

= (∂L/∂v · 1) + (∂L/∂u · x)

Notice how “sum at merge” naturally appears because y is used in two places.

Vector/matrix shapes: the practical reality #

Neural nets use vector-valued activations and matrix multiplies. The same local-rule logic holds, but you must respect shapes.

A very common pattern:

z = Wx + b

Where:

Given upstream gradient g = ∂L/∂z (same shape as z), the local backward rules are:

∂L/∂b = g

∂L/∂W = g x

∂L/∂x = Wᵀ g

These formulas are not “memorize-only.” They come from componentwise chain rule. For example, for W:

Let zᵢ = ∑ⱼ Wᵢⱼ xⱼ + bᵢ

Then:

∂L/∂Wᵢⱼ = ∑ₖ (∂L/∂zₖ)(∂zₖ/∂Wᵢⱼ)

But zₖ depends on Wᵢⱼ only when k = i, so:

∂zₖ/∂Wᵢⱼ = xⱼ if k = i, else 0

So:

∂L/∂Wᵢⱼ = (∂L/∂zᵢ) xⱼ

In matrix form that is g xᵀ.

What to cache from the forward pass #

Node-local derivatives typically depend on forward values (like x for σ′(x), or x for g****xᵀ).

So the forward pass doesn’t just compute outputs; it stores the needed intermediates (often called “activations” and sometimes “pre-activations”). Backprop then uses them.

A useful mental rule:

This is why training uses more memory than inference.

Summary table: common node-local backward rules #

Forward opForward definitionUpstream gradientBackward outputs
Addz = x + y∂L/∂z∂L/∂x = ∂L/∂z, ∂L/∂y = ∂L/∂z
Multiplyz = x·y∂L/∂z∂L/∂x = (∂L/∂z)·y, ∂L/∂y = (∂L/∂z)·x
Affinez = Wx + bg = ∂L/∂z∂L/∂W = g****xᵀ, ∂L/∂b = g, ∂L/∂x = Wᵀg
ReLUa = max(0, z)∂L/∂a∂L/∂z = (∂L/∂a) ⊙ 𝟙[z>0]
Sigmoida = σ(z)∂L/∂a∂L/∂z = (∂L/∂a) ⊙ a ⊙ (1−a)
Tanha = tanh(z)∂L/∂a∂L/∂z = (∂L/∂a) ⊙ (1−a²)

(⊙ is elementwise product.)

These are “node-local rules”: upstream gradient in, multiply by local derivative, produce gradients out.

Core Mechanic 2: Reverse Traversal of the Computation Graph (and Why We Sum at Merges) #

Why reverse order? #

In the forward pass, you compute prerequisites first: inputs → intermediate values → loss.

In the backward pass, you need ∂L/∂(something). But ∂L/∂(something) depends on how that “something” affects later computations. So you must start at L and propagate backward: loss → outputs → earlier intermediates → parameters.

This is exactly reverse topological order on the computation graph.

A clean statement of the algorithm #

Given a computation graph that produces scalar loss L:

  1. Forward pass
  1. Initialize the backward pass
  1. Reverse traversal
  1. Read off parameter gradients

Why do gradients add at merges? (A derivation) #

Suppose x feeds into two children nodes z₁ and z₂, and both influence L.

Let:

z₁ = f₁(x)

z₂ = f₂(x)

L = g(z₁, z₂)

We can derive ∂L/∂x explicitly:

∂L/∂x = (∂L/∂z₁)(∂z₁/∂x) + (∂L/∂z₂)(∂z₂/∂x)

Steps:

∂L/∂x = ∂g/∂z₁ · ∂z₁/∂x + ∂g/∂z₂ · ∂z₂/∂x

Recognize ∂g/∂z₁ = ∂L/∂z₁ and ∂g/∂z₂ = ∂L/∂z₂.

So if x has multiple outgoing edges in the forward graph, it has multiple incoming gradient contributions in the backward graph, which must be summed.

“δ” notation and layerwise backprop #

In neural networks, it’s common to distinguish:

For layer ℓ:

zˡ = Wˡ aˡ⁻¹ + bˡ

aˡ = φ(zˡ)

Define the layer sensitivity:

δˡ := ∂L/∂zˡ

This is exactly your node-local symbol δⱼ := ∂L/∂zⱼ, but grouped by layer and vectorized.

Now derive the standard backprop identities.

Step 1: Backprop through the activation #

We have aˡ = φ(zˡ). So elementwise:

∂L/∂zˡ = ∂L/∂aˡ ⊙ φ′(zˡ)

So:

δˡ = (∂L/∂aˡ) ⊙ φ′(zˡ)

Step 2: Backprop through the affine transform #

zˡ = Wˡ aˡ⁻¹ + bˡ.

Given δˡ = ∂L/∂zˡ, we get:

∂L/∂Wˡ = δˡ (aˡ⁻¹)ᵀ

∂L/∂bˡ = δˡ

∂L/∂aˡ⁻¹ = (Wˡ)ᵀ δˡ

Now plug Step 2 into Step 1 for the previous layer:

δˡ⁻¹ = (∂L/∂aˡ⁻¹) ⊙ φ′(zˡ⁻¹)

= ((Wˡ)ᵀ δˡ) ⊙ φ′(zˡ⁻¹)

That is the classic recurrence.

Where the loss gradient starts #

The backward pass starts from the loss. Two common cases:

  1. Regression with MSE

If L = ½‖aᴸ − y‖², then:

∂L/∂aᴸ = aᴸ − y

Then δᴸ = (∂L/∂aᴸ) ⊙ φ′(zᴸ).

  1. Classification with softmax + cross-entropy

A very common final layer is:

p = softmax(z)

L = −∑ᵢ yᵢ log pᵢ

A key simplification (worth remembering) is:

∂L/∂z = py

So the starting sensitivity at the logits is just prediction minus label.

Computational story: forward is evaluation, backward is accumulation #

A useful way to imagine backprop in code:

This is reverse-mode automatic differentiation (reverse-mode AD). Backpropagation is reverse-mode AD specialized to neural nets.

Quick comparison: forward-mode vs reverse-mode #

ModeComputesBest whenCost intuition
Forward-mode ADdirectional derivatives (Jacobian-vector products)few inputs, many outputsroughly scales with #inputs
Reverse-mode AD (backprop)gradients of scalar loss (vector-Jacobian products)many parameters, scalar lossroughly scales with #outputs (≈1)

Neural nets have huge input dimension in parameter space (millions of weights) and a single scalar loss, so reverse-mode is ideal.

Application/Connection: Backprop in Real Neural Networks (Batches, Modules, and Practical Gotchas) #

From single example to minibatches #

In practice, we rarely train on one example at a time. With a minibatch of size B:

Suppose X ∈ ℝ^{n×B} and Z = WX + b, where b is broadcast across columns.

Let G = ∂L/∂Z ∈ ℝ^{m×B}.

Then the batched gradients are:

∂L/∂W = G X

∂L/∂b = ∑_{k=1}^{B} G[:, k]

∂L/∂X = Wᵀ G

Notice:

Modular backprop: layers as reusable nodes #

Deep learning frameworks implement layers/modules that each provide:

This is exactly the node-local backward rule, just grouped into a bigger node.

Common modules:

Even when the module is complicated, its backward pass follows the same idea:

Why caching matters (time vs memory) #

Backprop needs forward intermediates. For deep nets, storing every activation can be expensive.

Two practical strategies:

  1. Store everything (standard)
  1. Checkpointing / recomputation

Exploding/vanishing gradients: a backprop perspective #

Backprop multiplies many Jacobians together. For a deep chain:

∂L/∂x = (J₁ᵀ J₂ᵀ … Jₖᵀ) ∂L/∂z

If the norms of these Jacobians are often < 1, gradients shrink (vanish). If often > 1, they grow (explode).

This is why choices like:

matter so much. They change the effective Jacobians that backprop multiplies.

Residual connections and “sum at merges” #

A residual block often looks like:

h = x + F(x)

In backprop:

∂L/∂x = ∂L/∂h · ∂h/∂x

But h depends on x through two paths: identity and F.

So:

∂L/∂x = ∂L/∂h · I + (∂L/∂h) · ∂F/∂x

Meaning:

∂L/∂x = ∂L/∂h + (∂L/∂h)·∂F/∂x

The identity path contributes a clean gradient term, which helps gradients flow even if ∂F/∂x becomes small.

Backprop as the engine behind “Deep Learning” #

Deep architectures (CNNs, RNNs, Transformers) are essentially large computation graphs with repeated motifs.

Backprop is what makes them trainable at scale:

Once you are comfortable with δ propagation and reverse traversal, you’re ready to reason about:

That is the bridge to the next node: deeper architectures and the systems that train them.

Worked Examples (3) #

Example 1: Scalar computation graph with a merge (explicit δ propagation) #

Let x and y be scalars. Define:

  1. u = x·y

  2. v = u + y

  3. L = v²

Compute ∂L/∂x and ∂L/∂y using backprop (node-local rules + reverse traversal).

  1. Forward pass:

    • •u = x·y
    • •v = u + y
    • •L = v²
  2. Initialize backward:

    • •∂L/∂L = 1

    Work from L backward.

  3. Backprop through L = v²:

    • •∂L/∂v = 2v
  4. Backprop through v = u + y:

    Local partials: ∂v/∂u = 1, ∂v/∂y = 1

    So:

    • •∂L/∂u = ∂L/∂v · 1 = 2v
    • •Contribution to ∂L/∂y (path via v) is: (∂L/∂v)·1 = 2v
  5. Backprop through u = x·y:

    Local partials: ∂u/∂x = y, ∂u/∂y = x

    So:

    • •∂L/∂x = ∂L/∂u · y = (2v)·y
    • •Contribution to ∂L/∂y (path via u) is: ∂L/∂u · x = (2v)·x
  6. Sum contributions at the merge for y:

    • •∂L/∂y = (2v) + (2v·x)

    = 2v(1 + x)

    And ∂L/∂x = 2v·y, where v = u + y = xy + y = y(x+1).

Insight: The only global coordination is the reverse traversal and the “add at merges” rule. Everything else is local: multiply the upstream gradient by a local derivative.

Example 2: One hidden-layer network (vectorized) and the classic δ recurrence #

Consider a 2-layer (one hidden layer) network for a single example:

Loss: L = ½‖a² − y‖²

Derive ∂L/∂W², ∂L/∂b², ∂L/∂W¹, ∂L/∂b¹ using δ notation.

  1. Start at the loss:

    L = ½‖a² − y‖²

    So:

    ∂L/∂a² = a² − y

  2. Output is identity: a² = z²

    Therefore:

    δ² := ∂L/∂z² = ∂L/∂a² = a² − y

  3. Backprop through z² = W²a¹ + b² using the affine rules with upstream gradient δ²:

    ∂L/∂W² = δ² (a¹)ᵀ

    ∂L/∂b² = δ²

    ∂L/∂a¹ = (W²)ᵀ δ²

  4. Backprop through a¹ = φ(z¹):

    δ¹ := ∂L/∂z¹ = (∂L/∂a¹) ⊙ φ′(z¹)

    So:

    δ¹ = ((W²)ᵀ δ²) ⊙ φ′(z¹)

  5. Backprop through z¹ = W¹x + b¹:

    ∂L/∂W¹ = δ¹ x

    ∂L/∂b¹ = δ¹

    And (if needed) ∂L/∂x = (W¹)ᵀ δ¹

Insight: Backprop for layered networks is just repeated application of two templates: (1) affine backward, (2) activation backward. The δ vectors store exactly the sensitivities you need to reuse.

Example 3: Softmax + cross-entropy gives δ = **p** − **y** #

Let logits z ∈ ℝ^K. Softmax probabilities: pᵢ = exp(zᵢ) / ∑ⱼ exp(zⱼ).

Cross-entropy loss with one-hot y: L = −∑ᵢ yᵢ log pᵢ.

Show that ∂L/∂z = py.

  1. Rewrite the loss in a convenient form.

    Since log pᵢ = zᵢ − log(∑ⱼ exp(zⱼ)), we have:

    L = −∑ᵢ yᵢ[zᵢ − log(∑ⱼ exp(zⱼ))]

    = −∑ᵢ yᵢ zᵢ + (∑ᵢ yᵢ) log(∑ⱼ exp(zⱼ))

  2. For one-hot (or any distribution label), ∑ᵢ yᵢ = 1, so:

    L = −∑ᵢ yᵢ zᵢ + log(∑ⱼ exp(zⱼ))

  3. Differentiate w.r.t. z_k:

    ∂/∂z_k [−∑ᵢ yᵢ zᵢ] = −y_k

  4. Differentiate the log-sum-exp term:

    Let S = ∑ⱼ exp(zⱼ)

    Then:

    ∂/∂z_k [log S] = (1/S) · ∂S/∂z_k

    And:

    ∂S/∂z_k = exp(z_k)

    So:

    ∂/∂z_k [log(∑ⱼ exp(zⱼ))] = exp(z_k) / ∑ⱼ exp(zⱼ) = p_k

  5. Combine terms:

    ∂L/∂z_k = p_k − y_k

    Stacking all k gives:

    ∂L/∂z = py

Insight: This simplification is why softmax + cross-entropy is so popular: it produces a clean starting sensitivity at the logits, avoiding an extra Jacobian factor.

Key Takeaways #

Common Mistakes #

Practice #

easy

Scalar backprop practice: Let a = x + y, b = x·a, L = b². Compute ∂L/∂x and ∂L/∂y in terms of x and y.

Hint: Work backward: L→b→(x,a) and a→(x,y). Remember that x influences b both directly and through a, so you must add contributions for ∂L/∂x.

Show solution

Forward:

a = x + y

b = x·a = x(x+y)

L = b²

Backward:

∂L/∂b = 2b

For b = x·a:

∂b/∂x = a, ∂b/∂a = x

So contributions:

∂L/∂x (via b direct) = (2b)·a

∂L/∂a = (2b)·x

For a = x + y:

∂a/∂x = 1, ∂a/∂y = 1

So:

Contribution to ∂L/∂x (via a) = ∂L/∂a · 1 = (2b)x

∂L/∂y = ∂L/∂a · 1 = (2b)x

Add for x:

∂L/∂x = 2b·a + 2b·x = 2b(a + x)

Substitute a = x+y and b = x(x+y):

∂L/∂x = 2x(x+y)[(x+y)+x] = 2x(x+y)(2x+y)

∂L/∂y = 2x·b = 2x·x(x+y) = 2x²(x+y)

medium

Affine + ReLU layer: z = Wx + b, a = ReLU(z). Suppose the upstream gradient is g = ∂L/∂a. Derive expressions for ∂L/∂W, ∂L/∂b, and ∂L/∂x using g and cached z.

Hint: First convert g into δ = ∂L/∂z by multiplying by the ReLU mask 𝟙[z>0]. Then use affine backward rules.

Show solution

ReLU backward (elementwise):

δ := ∂L/∂z = g ⊙ 𝟙[z>0]

Affine backward with upstream δ:

∂L/∂W = δ x

∂L/∂b = δ

∂L/∂x = Wᵀ δ

hard

Show the merge-sum rule in vector form: Let h = x + F(x) where F: ℝ^n → ℝ^n is differentiable. Given upstream gradient u = ∂L/∂h, express ∂L/∂x in terms of u and the Jacobian J_F = ∂F/∂x.

Hint: Treat h as two paths from x: identity and F. The derivative of identity is I. Use vector-Jacobian products appropriate for reverse-mode.

Show solution

We have h(x) = x + F(x).

The Jacobian is:

h/∂x = I + J_F

Reverse-mode uses a vector-Jacobian product:

∂L/∂x = (∂h/∂x)ᵀ u

= (I + J_F)ᵀ u

= Iᵀ u + J_Fᵀ u

= u + J_Fᵀ u

This is exactly “sum contributions at the merge”: the identity path contributes u, and the F path contributes J_Fᵀ u.

Connections #

Unlocks: Deep Learning

Related prerequisite refreshers:

Next good companions (often adjacent in a tech tree):

Quality: A (4.5/5)

← back to treebrowse all →