Multivariable Chain Rule

←Back to Tech Tree

inventorycoverage

Multivariable Chain Rule #

CalculusDifficulty: ★★★☆☆Depth: 6Unlocks: 5

Derivatives of composed functions with multiple variables.

Interactive Visualization #

⏮◀◀▶▶STEP0.25x1xZOOM

t=0s

Core Concepts #

Key Symbols & Notation #

D f (the Jacobian matrix / derivative of f)o (function composition)

Essential Relationships #

Prerequisites (2) #

Gradients5 atomsDerivative Rules5 atoms

Unlocks (1) #

Backpropagationlvl 4

Advanced Learning Details

Graph Position #

56

Depth Cost

5

Fan-Out (ROI)

2

Bottleneck Score

6

Chain Length

Cognitive Load #

6

Atomic Elements

24

Total Elements

L0

Percentile Level

L4

Atomic Level

All Concepts (9) #

Teaching Strategy #

Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.

In single-variable calculus, the chain rule is a one-line formula. In multivariable calculus, it’s the same idea—“derivatives multiply along a composition”—but the objects are linear maps, so the multiplication becomes matrix multiplication. Once you truly see that, backpropagation stops feeling like magic.

TL;DR:

For a composition h=f∘gh = f \circ gh=f∘g where g:Rn→Rmg: \mathbb{R}^n \to \mathbb{R}^mg:Rn→Rm and f:Rm→Rpf: \mathbb{R}^m \to \mathbb{R}^pf:Rm→Rp, the multivariable chain rule is

D(f∘g)(x)=Df(g(x)) Dg(x).D(f \circ g)(\mathbf{x}) = Df(g(\mathbf{x})), Dg(\mathbf{x}).D(f∘g)(x)=Df(g(x))Dg(x).

Interpretation: a small input perturbation Δx\Delta \mathbf{x}Δx is pushed forward by DgDgDg into Δu\Delta \mathbf{u}Δu, then pushed forward by DfDfDf into Δy\Delta \mathbf{y}Δy. For scalar output p=1p=1p=1, gradients pull back via transpose: ∇h(x)=Dg(x)⊤∇f(g(x))\nabla h(\mathbf{x}) = Dg(\mathbf{x})^\top \nabla f(g(\mathbf{x}))∇h(x)=Dg(x)⊤∇f(g(x)).

What Is the Multivariable Chain Rule? #

Why you need a new-looking chain rule #

In 1D, composition looks like h(x)=f(g(x))h(x) = f(g(x))h(x)=f(g(x)) and the chain rule says h′(x)=f′(g(x)) g′(x)h'(x) = f'(g(x)), g'(x)h′(x)=f′(g(x))g′(x).

In multiple dimensions, you still compose functions, but now each function can take multiple inputs and produce multiple outputs. The “derivative” is no longer a single number; it’s a linear map that best approximates the function near a point. Linear maps are represented by matrices, so “multiply the derivatives” becomes “multiply the Jacobian matrices.”

This is the conceptual core:

The composition structure (track shapes) #

The cleanest multivariable chain rule is written with a clear inner/outer structure:

Let:

A quick “shape table” you should get used to:

ObjectMeaningShape
xinputn×1
u = g(x)intermediatem×1
y = f(u)outputp×1
Dg(x)Dg(\mathbf{x})Dg(x)Jacobian of g at xm×n
Df(u)Df(\mathbf{u})Df(u)Jacobian of f at up×m
D(f∘g)(x)D(f\circ g)(\mathbf{x})D(f∘g)(x)Jacobian of compositionp×n

Notice the only multiplication that makes sense dimensionally is:

(p×m) (m×n)=(p×n).(p\times m),(m\times n) = (p\times n).(p×m)(m×n)=(p×n).

That is already a big part of the multivariable chain rule: the shapes force the correct order.

The chain rule (matrix form) #

The multivariable chain rule states:

D(f∘g)(x)=Df(g(x)) Dg(x).D(f \circ g)(\mathbf{x}) = Df(g(\mathbf{x})), Dg(\mathbf{x}).D(f∘g)(x)=Df(g(x))Dg(x).

Read it as: “first apply the derivative of ggg at x, then apply the derivative of fff at the resulting point g(x)g(\mathbf{x})g(x).”

What it means geometrically (tiny perturbations) #

If you perturb the input by a small vector Δx\Delta \mathbf{x}Δx, then:

  1. Push forward through g:

Δu≈Dg(x) Δx.\Delta \mathbf{u} \approx Dg(\mathbf{x}), \Delta \mathbf{x}.Δu≈Dg(x)Δx.

  1. Push forward through f:

Δy≈Df(u) Δu.\Delta \mathbf{y} \approx Df(\mathbf{u}), \Delta \mathbf{u}.Δy≈Df(u)Δu.

Combine them:

Δy≈Df(g(x)) Dg(x) Δx.\Delta \mathbf{y} \approx Df(g(\mathbf{x})), Dg(\mathbf{x}), \Delta \mathbf{x}.Δy≈Df(g(x))Dg(x)Δx.

So the Jacobian of the composition is the matrix that takes Δx\Delta \mathbf{x}Δx directly to Δy\Delta \mathbf{y}Δy. That matrix is the product Df DgDf,DgDfDg.

A note on notation: Df vs ∇f #

You said you know gradients already. The key is to keep these distinct:

When p=1p=1p=1, the Jacobian of fff is a 1×m row vector; the gradient is usually written as an m×1 column vector. They are transposes of each other (depending on convention):

Df(u) is 1×m,∇f(u) is m×1.Df(\mathbf{u}) \text{ is } 1\times m, \qquad \nabla f(\mathbf{u}) \text{ is } m\times 1.Df(u) is 1×m,∇f(u) is m×1.

This transpose issue matters a lot in backprop, so we’ll be explicit about it later.

Core Mechanic 1: Derivative as Best Linear Approximation (and the Jacobian) #

Why this viewpoint #

If you treat multivariable derivatives as “a bunch of partial derivatives,” you can still compute things, but it’s easy to lose track of structure.

If you treat the derivative as “the best linear map near a point,” everything becomes systematic:

The linear approximation definition #

Let g:Rn→Rmg: \mathbb{R}^n \to \mathbb{R}^mg:Rn→Rm. We say ggg is differentiable at x if there exists a linear map L:Rn→RmL: \mathbb{R}^n \to \mathbb{R}^mL:Rn→Rm such that

g(x+Δx)≈g(x)+L Δxg(\mathbf{x} + \Delta \mathbf{x}) \approx g(\mathbf{x}) + L,\Delta \mathbf{x}g(x+Δx)≈g(x)+LΔx

with an error that becomes negligible compared to ∥Δx∥|\Delta \mathbf{x}|∥Δx∥ as Δx→0\Delta \mathbf{x} \to 0Δx→0.

That linear map is the derivative Dg(x)Dg(\mathbf{x})Dg(x).

Jacobian matrix: how the linear map is represented #

Write g(x)=[g1(x)⋮gm(x)]g(\mathbf{x}) = \begin{bmatrix} g_1(\mathbf{x}) \ \vdots \ g_m(\mathbf{x}) \end{bmatrix}g(x)=​g1​(x)⋮gm​(x)​​.

Then the Jacobian Dg(x)Dg(\mathbf{x})Dg(x) is the m×n matrix:

Dg(x)=[∂g1∂x1⋯∂g1∂xn⋮⋱⋮∂gm∂x1⋯∂gm∂xn].Dg(\mathbf{x}) = \begin{bmatrix}
\frac{\partial g_1}{\partial x_1} & \cdots & \frac{\partial g_1}{\partial x_n} \
\vdots & \ddots & \vdots \
\frac{\partial g_m}{\partial x_1} & \cdots & \frac{\partial g_m}{\partial x_n}
\end{bmatrix}.Dg(x)=​∂x1​∂g1​​⋮∂x1​∂gm​​​⋯⋱⋯​∂xn​∂g1​​⋮∂xn​∂gm​​​​.

This is exactly the matrix that maps a small input perturbation to an approximate output perturbation:

Δu≈Dg(x) Δx.\Delta \mathbf{u} \approx Dg(\mathbf{x}), \Delta \mathbf{x}.Δu≈Dg(x)Δx.

Interactive-canvas mental model (shape-tracking + arrows) #

Imagine an “interactive canvas” with three boxes:

  1. x-box: x ∈ ℝⁿ

  2. u-box: u = g(x) ∈ ℝᵐ

  3. y-box: y = f(u) ∈ ℝᵖ

Now add two kinds of arrows:

Arrow type A: pushforward of perturbations (forward mode) #

You draw a little arrow Δx\Delta \mathbf{x}Δx at the input.

So perturbations flow with the function direction.

Arrow type B: pullback of sensitivities / gradients (reverse mode) #

If the final output is scalar (p=1), you draw a gradient arrow at the output: ∇yℓ\nabla_{\mathbf{y}} \ell∇y​ℓ (often just 1 if the scalar is the loss itself).

Then gradients flow backwards through transposes:

This is the heart of backprop. In this lesson, we’re building the “transpose reflex”:

A small but crucial convention check #

There are two common conventions:

  1. Jacobian is m×n (outputs by inputs). Gradients are column vectors.

  2. Jacobian is n×m (inputs by outputs). Gradients are row vectors.

We’ll use the most common ML convention:

With that convention, the chain rule for scalar output becomes a clean transpose pullback (we’ll derive it soon).

Core Mechanic 2: Chain Rule = Composition of Linear Maps (Matrix Multiplication) #

Why matrix multiplication appears #

The multivariable chain rule is not a new rule you memorize. It is a consequence of one fact:

If you approximate each function by a linear map near the relevant point, then approximating the composition means composing those linear maps.

And linear maps compose by matrix multiplication.

Derivation (showing the work) #

Let g:Rn→Rmg: \mathbb{R}^n \to \mathbb{R}^mg:Rn→Rm and f:Rm→Rpf: \mathbb{R}^m \to \mathbb{R}^pf:Rm→Rp.

Define:

Start with a small perturbation Δx\Delta \mathbf{x}Δx.

Step 1: Linearize g at x

g(x+Δx)≈g(x)+Dg(x) Δx.g(\mathbf{x}+\Delta\mathbf{x}) \approx g(\mathbf{x}) + Dg(\mathbf{x}),\Delta\mathbf{x}.g(x+Δx)≈g(x)+Dg(x)Δx.

Let Δu=Dg(x) Δx\Delta \mathbf{u} = Dg(\mathbf{x}),\Delta\mathbf{x}Δu=Dg(x)Δx, so

g(x+Δx)≈u+Δu.g(\mathbf{x}+\Delta\mathbf{x}) \approx \mathbf{u} + \Delta\mathbf{u}.g(x+Δx)≈u+Δu.

Step 2: Linearize f at u

f(u+Δu)≈f(u)+Df(u) Δu.f(\mathbf{u}+\Delta\mathbf{u}) \approx f(\mathbf{u}) + Df(\mathbf{u}),\Delta\mathbf{u}.f(u+Δu)≈f(u)+Df(u)Δu.

Substitute Δu=Dg(x) Δx\Delta \mathbf{u} = Dg(\mathbf{x}),\Delta\mathbf{x}Δu=Dg(x)Δx:

f(u+Δu)≈f(u)+Df(u) Dg(x) Δx.f(\mathbf{u}+\Delta\mathbf{u}) \approx f(\mathbf{u}) + Df(\mathbf{u}),Dg(\mathbf{x}),\Delta\mathbf{x}.f(u+Δu)≈f(u)+Df(u)Dg(x)Δx.

But the left side is approximately

f(g(x+Δx))=(f∘g)(x+Δx).f(g(\mathbf{x}+\Delta\mathbf{x})) = (f\circ g)(\mathbf{x}+\Delta\mathbf{x}).f(g(x+Δx))=(f∘g)(x+Δx).

So we have the linear approximation:

(f∘g)(x+Δx)≈(f∘g)(x)+(Df(g(x)) Dg(x)) Δx.(f\circ g)(\mathbf{x}+\Delta\mathbf{x}) \approx (f\circ g)(\mathbf{x}) + \bigl(Df(g(\mathbf{x})),Dg(\mathbf{x})\bigr),\Delta\mathbf{x}.(f∘g)(x+Δx)≈(f∘g)(x)+(Df(g(x))Dg(x))Δx.

By uniqueness of the best linear approximation, the derivative must be:

D(f∘g)(x)=Df(g(x)) Dg(x).D(f \circ g)(\mathbf{x}) = Df(g(\mathbf{x})),Dg(\mathbf{x}).D(f∘g)(x)=Df(g(x))Dg(x).

That’s the multivariable chain rule.

Element-wise chain rule (path-tracing through indices) #

Sometimes you want the version that looks like “sum over paths.”

Let h=f∘gh = f \circ gh=f∘g with components:

Then for each output component i and input component j:

∂hi∂xj(x)=∑k=1m∂fi∂uk(u) ∂gk∂xj(x),u=g(x).\frac{\partial h_i}{\partial x_j}(\mathbf{x}) = \sum_{k=1}^{m} \frac{\partial f_i}{\partial u_k}(\mathbf{u}),\frac{\partial g_k}{\partial x_j}(\mathbf{x}),\quad \mathbf{u}=g(\mathbf{x}).∂xj​∂hi​​(x)=k=1∑m​∂uk​∂fi​​(u)∂xj​∂gk​​(x),u=g(x).

This is literally matrix multiplication in coordinates.

Path-tracing interpretation:

“Computational graph” view (interactive canvas) #

Think of a small graph:

For our two-layer composition:

x →(g)→ u →(f)→ y

On an interactive canvas, you can show two overlays:

Overlay 1: Jacobians on edges #

To get total derivative xy, multiply along the path:

Dxy=Df Dg.D_{\mathbf{x}}\mathbf{y} = Df,Dg.Dx​y=DfDg.

Overlay 2: live perturbations and gradients #

Pick a concrete point x₀.

∇xℓ=Dg(x)⊤ Df(u)⊤ ∇yℓ.\nabla_{\mathbf{x}}\ell = Dg(\mathbf{x})^\top,Df(\mathbf{u})^\top,\nabla_{\mathbf{y}}\ell.∇x​ℓ=Dg(x)⊤Df(u)⊤∇y​ℓ.

This “two-way animation” is the visualization you want to internalize:

Scalar output special case (gradient form) #

Now suppose f:Rm→Rf: \mathbb{R}^m \to \mathbb{R}f:Rm→R is scalar. Then Df(u)Df(\mathbf{u})Df(u) is 1×m.

From the Jacobian chain rule:

D(f∘g)(x)=Df(u) Dg(x).D(f\circ g)(\mathbf{x}) = Df(\mathbf{u}),Dg(\mathbf{x}).D(f∘g)(x)=Df(u)Dg(x).

This left side is a 1×n row vector. If we want the gradient as an n×1 column vector, transpose:

\begin{align*}

\nabla (f\circ g)(\mathbf{x})

&= D(f\circ g)(\mathbf{x})^\top \

&= \bigl(Df(\mathbf{u}),Dg(\mathbf{x})\bigr)^\top \

&= Dg(\mathbf{x})^\top,Df(\mathbf{u})^\top \

&= Dg(\mathbf{x})^\top,\nabla f(\mathbf{u}).

\end{align*}

That last line is the standard “gradient chain rule” used everywhere in ML.

One more composition (three layers) #

If you have x→gu→fv→ry\mathbf{x} \xrightarrow{g} \mathbf{u} \xrightarrow{f} \mathbf{v} \xrightarrow{r} \mathbf{y}xg​uf​vr​y, then:

D(r∘f∘g)(x)=Dr(v) Df(u) Dg(x).D(r\circ f\circ g)(\mathbf{x}) = Dr(\mathbf{v}),Df(\mathbf{u}),Dg(\mathbf{x}).D(r∘f∘g)(x)=Dr(v)Df(u)Dg(x).

Forward-mode perturbations multiply left-to-right in the same order as the functions apply (inner to outer).

Reverse-mode gradients multiply by transposes right-to-left:

∇xℓ=Dg(x)⊤ Df(u)⊤ Dr(v)⊤ ∇yℓ.\nabla_{\mathbf{x}}\ell = Dg(\mathbf{x})^\top,Df(\mathbf{u})^\top,Dr(\mathbf{v})^\top,\nabla_{\mathbf{y}}\ell.∇x​ℓ=Dg(x)⊤Df(u)⊤Dr(v)⊤∇y​ℓ.

This is backprop in one line—just applied repeatedly.

Application/Connection: From Multivariable Chain Rule to Backprop Intuition #

Why this matters for ML #

Neural networks are compositions of many vector-valued functions:

x→h(1)→h(2)→⋯→y^→ℓ.\mathbf{x} \to \mathbf{h}^{(1)} \to \mathbf{h}^{(2)} \to \cdots \to \hat{\mathbf{y}} \to \ell.x→h(1)→h(2)→⋯→y^​→ℓ.

Training requires ∇θℓ\nabla_{\theta}\ell∇θ​ℓ, gradients with respect to parameters. The only tool you need conceptually is the multivariable chain rule, but applied efficiently.

The hard part is not the calculus; it’s bookkeeping:

A concrete computational graph (with explicit shapes) #

Let’s build a tiny “network” with one hidden layer and a scalar loss. Define:

Forward computation:

  1. a = Wx+bW\mathbf{x} + \mathbf{b}Wx+b (so a ∈ ℝ³)

  2. h = σ(a)\sigma(\mathbf{a})σ(a) elementwise (so h ∈ ℝ³)

  3. ℓ=c⊤h\ell = \mathbf{c}^\top \mathbf{h}ℓ=c⊤h (so ℓ\ellℓ ∈ ℝ)

This is a composition:

x →(affine)→ a →(nonlinearity)→ h →(dot)→ ℓ

On an interactive canvas, you can attach:

EdgeLocal derivativeShape
xaDxa=WD_{\mathbf{x}}\mathbf{a} = WDx​a=W3×2
ahDah=Diag⁡(σ′(a))D_{\mathbf{a}}\mathbf{h} = \operatorname{Diag}(\sigma'(\mathbf{a}))Da​h=Diag(σ′(a))3×3
h→ℓDhℓ=c⊤D_{\mathbf{h}}\ell = \mathbf{c}^\topDh​ℓ=c⊤1×3

Forward-mode (push a perturbation) #

A perturbation Δx\Delta \mathbf{x}Δx produces:

\begin{align*}

\Delta \mathbf{a} &\approx W,\Delta \mathbf{x} \

\Delta \mathbf{h} &\approx \operatorname{Diag}(\sigma'(\mathbf{a})),\Delta \mathbf{a} \

\Delta \ell &\approx \mathbf{c}^\top,\Delta \mathbf{h}.

\end{align*}

Combine:

Δℓ≈c⊤ Diag⁡(σ′(a)) W Δx.\Delta \ell \approx \mathbf{c}^\top,\operatorname{Diag}(\sigma'(\mathbf{a})),W,\Delta \mathbf{x}.Δℓ≈c⊤Diag(σ′(a))WΔx.

So the total Jacobian (1×2 row vector) is:

Dxℓ=c⊤ Diag⁡(σ′(a)) W.D_{\mathbf{x}}\ell = \mathbf{c}^\top,\operatorname{Diag}(\sigma'(\mathbf{a})),W.Dx​ℓ=c⊤Diag(σ′(a))W.

Reverse-mode (pull back a gradient) #

Because ℓ is scalar, we typically want ∇xℓ\nabla_{\mathbf{x}}\ell∇x​ℓ as a 2×1 column vector.

Start with ∇ℓℓ=1\nabla_{\ell}\ell = 1∇ℓ​ℓ=1.

∇hℓ=c.\nabla_{\mathbf{h}}\ell = \mathbf{c}.∇h​ℓ=c.

∇aℓ=Diag⁡(σ′(a)) ∇hℓ=Diag⁡(σ′(a)) c.\nabla_{\mathbf{a}}\ell = \operatorname{Diag}(\sigma'(\mathbf{a})),\nabla_{\mathbf{h}}\ell = \operatorname{Diag}(\sigma'(\mathbf{a})),\mathbf{c}.∇a​ℓ=Diag(σ′(a))∇h​ℓ=Diag(σ′(a))c.

∇xℓ=W⊤ ∇aℓ=W⊤ Diag⁡(σ′(a)) c.\nabla_{\mathbf{x}}\ell = W^\top,\nabla_{\mathbf{a}}\ell = W^\top,\operatorname{Diag}(\sigma'(\mathbf{a})),\mathbf{c}.∇x​ℓ=W⊤∇a​ℓ=W⊤Diag(σ′(a))c.

Compare with the forward-mode Jacobian expression above:

They are transposes, consistent with ∇ℓ=(Dℓ)⊤\nabla \ell = (D\ell)^\top∇ℓ=(Dℓ)⊤.

Visual intuition: pushforward vs pullback #

To address visualization explicitly, here’s the picture you should rehearse:

  1. Pick a point x₀.

  2. Draw a tiny arrow Δx\Delta \mathbf{x}Δx at x.

  3. Multiply by local Jacobians to watch the arrow morph:

Now reverse:

  1. Draw a gradient arrow at h: it points in the direction that increases ℓ fastest in h-space.

  2. Pull it back to a using the transpose of the local Jacobian: Diag⁡(σ′)\operatorname{Diag}(\sigma')Diag(σ′) (symmetric here, so transpose doesn’t change it).

  3. Pull it back to x using W⊤W^\topW⊤.

This is not two unrelated processes. It’s the same linear maps viewed from two dual perspectives:

You don’t need the formal differential-geometry language to use it correctly, but you do need the operational rule:

If forward uses JJJ, backward uses J⊤J^\topJ⊤.

Connection you’ll use next #

Backpropagation is essentially the repeated application of:

∇inputℓ=J⊤ ∇outputℓ.\nabla_{\text{input}}\ell = J^\top,\nabla_{\text{output}}\ell.∇input​ℓ=J⊤∇output​ℓ.

where JJJ is the Jacobian of a local block in the computational graph.

Once you’re comfortable multiplying Jacobians (forward) and multiplying by transposes (backward), you’re ready to study backprop as an algorithmic optimization: reuse intermediate results so you don’t form huge Jacobian matrices explicitly.

Worked Examples (3) #

Matrix-form chain rule with a 2→2→1 composition (shape-check + gradient) #

Let g:R2→R2g: \mathbb{R}^2 \to \mathbb{R}^2g:R2→R2 and f:R2→Rf: \mathbb{R}^2 \to \mathbb{R}f:R2→R be

g(x1,x2)=[u1u2]=[x12+x2x1x2],f(u1,u2)=u1+u22.g(x_1,x_2) = \begin{bmatrix} u_1 \ u_2 \end{bmatrix} = \begin{bmatrix} x_1^2 + x_2 \ x_1x_2 \end{bmatrix}, \qquad f(u_1,u_2)= u_1 + u_2^2.g(x1​,x2​)=[u1​u2​​]=[x12​+x2​x1​x2​​],f(u1​,u2​)=u1​+u22​.

Define h=f∘gh = f\circ gh=f∘g. Compute Dh(x)Dh(\mathbf{x})Dh(x) and ∇h(x)\nabla h(\mathbf{x})∇h(x) at a general point, then evaluate at (x1,x2)=(1,2)(x_1,x_2)=(1,2)(x1​,x2​)=(1,2).

  1. Step 1: Compute Dg(x)Dg(\mathbf{x})Dg(x) (2×2).

    We have

    • •u1=x12+x2u_1 = x_1^2 + x_2u1​=x12​+x2​ so ∂u1/∂x1=2x1\partial u_1/\partial x_1 = 2x_1∂u1​/∂x1​=2x1​, ∂u1/∂x2=1\partial u_1/\partial x_2 = 1∂u1​/∂x2​=1.
    • •u2=x1x2u_2 = x_1x_2u2​=x1​x2​ so ∂u2/∂x1=x2\partial u_2/\partial x_1 = x_2∂u2​/∂x1​=x2​, ∂u2/∂x2=x1\partial u_2/\partial x_2 = x_1∂u2​/∂x2​=x1​.

    Thus

    Dg(x)=[2x11x2x1].Dg(\mathbf{x}) = \begin{bmatrix} 2x_1 & 1 \ x_2 & x_1 \end{bmatrix}.Dg(x)=[2x1​x2​​1x1​​].

  2. Step 2: Compute Df(u)Df(\mathbf{u})Df(u) (1×2).

    f(u1,u2)=u1+u22f(u_1,u_2)=u_1 + u_2^2f(u1​,u2​)=u1​+u22​ so

    Df(u)=[∂f/∂u1∂f/∂u2]=[12u2].Df(\mathbf{u}) = \begin{bmatrix} \partial f/\partial u_1 & \partial f/\partial u_2 \end{bmatrix} = \begin{bmatrix} 1 & 2u_2 \end{bmatrix}.Df(u)=[∂f/∂u1​​∂f/∂u2​​]=[1​2u2​​].

  3. Step 3: Apply the Jacobian chain rule.

    Dh(x)=Df(g(x)) Dg(x).Dh(\mathbf{x}) = Df(g(\mathbf{x})),Dg(\mathbf{x}).Dh(x)=Df(g(x))Dg(x).

    Substitute u2=x1x2u_2 = x_1x_2u2​=x1​x2​:

    Df(g(x))=[12x1x2].Df(g(\mathbf{x})) = \begin{bmatrix} 1 & 2x_1x_2 \end{bmatrix}.Df(g(x))=[1​2x1​x2​​].

    Now multiply:

    \begin{align*}

    Dh(\mathbf{x})

    &= \begin{bmatrix} 1 & 2x_1x_2 \end{bmatrix}

    \begin{bmatrix} 2x_1 & 1 \ x_2 & x_1 \end{bmatrix} \

    &= \begin{bmatrix}

    1\cdot 2x_1 + (2x_1x_2)\cdot x_2 ; , ; 1\cdot 1 + (2x_1x_2)\cdot x_1

    \end{bmatrix} \

    &= \begin{bmatrix}

    2x_1 + 2x_1x_2^2 ; , ; 1 + 2x_1^2x_2

    \end{bmatrix}.

    \end{align*}

  4. Step 4: Convert to gradient (column vector).

    Since hhh is scalar, DhDhDh is 1×2 and

    ∇h(x)=Dh(x)⊤=[2x1+2x1x221+2x12x2].\nabla h(\mathbf{x}) = Dh(\mathbf{x})^\top = \begin{bmatrix} 2x_1 + 2x_1x_2^2 \ 1 + 2x_1^2x_2 \end{bmatrix}.∇h(x)=Dh(x)⊤=[2x1​+2x1​x22​1+2x12​x2​​].

  5. Step 5: Evaluate at (1,2).

    ∇h(1,2)=[2⋅1+2⋅1⋅221+2⋅12⋅2]=[2+81+4]=[105].\nabla h(1,2)=\begin{bmatrix}2\cdot 1 + 2\cdot 1\cdot 2^2 \ 1 + 2\cdot 1^2\cdot 2\end{bmatrix}=
    \begin{bmatrix}2+8\1+4\end{bmatrix}=
    \begin{bmatrix}10\5\end{bmatrix}.∇h(1,2)=[2⋅1+2⋅1⋅221+2⋅12⋅2​]=[2+81+4​]=[105​].

Insight: The computation stayed organized because we never mixed ‘partial derivative rules’ randomly. We computed two local Jacobians with clear shapes (2×2 and 1×2), multiplied them in the only shape-consistent order, then transposed to get the gradient vector.

Pushforward vs pullback on a small computational graph (explicit J and Jᵀ) #

Let x ∈ ℝ². Define

  1. u = g(x) where g(x1,x2)=[u1u2]=[x1+2x2x1−x2]g(x_1,x_2) = \begin{bmatrix} u_1 \ u_2 \end{bmatrix} = \begin{bmatrix} x_1 + 2x_2 \ x_1 - x_2 \end{bmatrix}g(x1​,x2​)=[u1​u2​​]=[x1​+2x2​x1​−x2​​]

  2. scalar output ℓ=f(u)\ell = f(\mathbf{u})ℓ=f(u) where f(u1,u2)=u12+3u2f(u_1,u_2)= u_1^2 + 3u_2f(u1​,u2​)=u12​+3u2​

At x₀ = (1,1), compute:

  1. Step 1: Compute the forward values at x₀.

    At (1,1):

    • •u=g(1,1)=[1+2⋅11−1]=[30]\mathbf{u} = g(1,1) = \begin{bmatrix} 1+2\cdot 1 \ 1-1 \end{bmatrix} = \begin{bmatrix} 3 \ 0 \end{bmatrix}u=g(1,1)=[1+2⋅11−1​]=[30​].
    • •ℓ=f(3,0)=32+3⋅0=9\ell = f(3,0)= 3^2 + 3\cdot 0 = 9ℓ=f(3,0)=32+3⋅0=9.
  2. Step 2: Pushforward (perturbations use Jacobians).

    Compute Dg(x)Dg(\mathbf{x})Dg(x) (2×2). Since g is linear,

    Dg=[∂u1/∂x1∂u1/∂x2∂u2/∂x1∂u2/∂x2]=[121−1].Dg = \begin{bmatrix} \partial u_1/\partial x_1 & \partial u_1/\partial x_2 \ \partial u_2/\partial x_1 & \partial u_2/\partial x_2 \end{bmatrix} = \begin{bmatrix} 1 & 2 \ 1 & -1 \end{bmatrix}.Dg=[∂u1​/∂x1​∂u2​/∂x1​​∂u1​/∂x2​∂u2​/∂x2​​]=[11​2−1​].

    Compute Df(u)Df(\mathbf{u})Df(u) (1×2):

    Df(u)=[2u13].Df(\mathbf{u})=\begin{bmatrix}2u_1 & 3\end{bmatrix}.Df(u)=[2u1​​3​].

    At u = (3,0):

    Df(u)=[63].Df(\mathbf{u}) = \begin{bmatrix} 6 & 3 \end{bmatrix}.Df(u)=[6​3​].

    Now push Δx\Delta \mathbf{x}Δx forward:

    Δu≈Dg Δx=[121−1][0.01−0.02]=[0.01−0.040.01+0.02]=[−0.030.03].\Delta \mathbf{u} \approx Dg,\Delta \mathbf{x} = \begin{bmatrix} 1 & 2 \ 1 & -1 \end{bmatrix}\begin{bmatrix} 0.01 \ -0.02 \end{bmatrix} = \begin{bmatrix} 0.01-0.04 \ 0.01+0.02 \end{bmatrix} = \begin{bmatrix} -0.03 \ 0.03 \end{bmatrix}.Δu≈DgΔx=[11​2−1​][0.01−0.02​]=[0.01−0.040.01+0.02​]=[−0.030.03​].

    Then

    Δℓ≈Df Δu=[63][−0.030.03]=6(−0.03)+3(0.03)=−0.18+0.09=−0.09.\Delta \ell \approx Df,\Delta \mathbf{u} = \begin{bmatrix}6 & 3\end{bmatrix}\begin{bmatrix}-0.03\0.03\end{bmatrix} = 6(-0.03)+3(0.03) = -0.18+0.09 = -0.09.Δℓ≈DfΔu=[6​3​][−0.030.03​]=6(−0.03)+3(0.03)=−0.18+0.09=−0.09.

  3. Step 3: Pullback (gradients use transposes).

    Because ℓ\ellℓ is scalar, start with ∇uℓ\nabla_{\mathbf{u}}\ell∇u​ℓ:

    ∇uℓ=[∂ℓ/∂u1∂ℓ/∂u2]=[2u13].\nabla_{\mathbf{u}}\ell = \begin{bmatrix} \partial \ell/\partial u_1 \ \partial \ell/\partial u_2 \end{bmatrix} = \begin{bmatrix}2u_1\3\end{bmatrix}.∇u​ℓ=[∂ℓ/∂u1​∂ℓ/∂u2​​]=[2u1​3​].

    At u=(3,0):

    ∇uℓ=[63].\nabla_{\mathbf{u}}\ell = \begin{bmatrix}6\3\end{bmatrix}.∇u​ℓ=[63​].

    Now pull back through g using Dg⊤Dg^\topDg⊤:

    ∇xℓ=Dg⊤ ∇uℓ=[112−1][63]=[912−3]=[99].\nabla_{\mathbf{x}}\ell = Dg^\top,\nabla_{\mathbf{u}}\ell = \begin{bmatrix} 1 & 1 \ 2 & -1 \end{bmatrix}\begin{bmatrix}6\3\end{bmatrix} = \begin{bmatrix}9\12-3\end{bmatrix} = \begin{bmatrix}9\9\end{bmatrix}.∇x​ℓ=Dg⊤∇u​ℓ=[12​1−1​][63​]=[912−3​]=[99​].

  4. Step 4: Consistency check via dot product.

    The linear prediction should satisfy

    Δℓ≈∇xℓ⋅Δx.\Delta \ell \approx \nabla_{\mathbf{x}}\ell \cdot \Delta \mathbf{x}.Δℓ≈∇x​ℓ⋅Δx.

    Compute:

    [99]⋅[0.01−0.02]=9(0.01)+9(−0.02)=0.09−0.18=−0.09,\begin{bmatrix}9\9\end{bmatrix}\cdot\begin{bmatrix}0.01\-0.02\end{bmatrix} = 9(0.01)+9(-0.02)=0.09-0.18=-0.09,[99​]⋅[0.01−0.02​]=9(0.01)+9(−0.02)=0.09−0.18=−0.09,

    matching the pushforward result.

Insight: This example makes the duality visible: forward-mode computes the effect of a small move Δx\Delta \mathbf{x}Δx; reverse-mode computes the gradient that, when dotted with Δx\Delta \mathbf{x}Δx, predicts the same Δℓ\Delta \ellΔℓ. The transpose is exactly what makes those two views consistent.

Element-wise chain rule as “sum over intermediate coordinates” #

Let g:R3→R2g: \mathbb{R}^3\to\mathbb{R}^2g:R3→R2 and f:R2→R2f: \mathbb{R}^2\to\mathbb{R}^2f:R2→R2 be

g(x1,x2,x3)=[u1u2]=[x1x2x2+x32],f(u1,u2)=[y1y2]=[u1+u2u1u2].g(x_1,x_2,x_3)=\begin{bmatrix}u_1\u_2\end{bmatrix}=\begin{bmatrix}x_1x_2\x_2+x_3^2\end{bmatrix},\qquad f(u_1,u_2)=\begin{bmatrix}y_1\y_2\end{bmatrix}=\begin{bmatrix}u_1+u_2\u_1u_2\end{bmatrix}.g(x1​,x2​,x3​)=[u1​u2​​]=[x1​x2​x2​+x32​​],f(u1​,u2​)=[y1​y2​​]=[u1​+u2​u1​u2​​].

Compute ∂y2∂x3\frac{\partial y_2}{\partial x_3}∂x3​∂y2​​ at a general point using path-tracing.

  1. Step 1: Identify the dependency paths.

    We want y2=u1u2y_2 = u_1u_2y2​=u1​u2​.

    • •u1=x1x2u_1 = x_1x_2u1​=x1​x2​ does NOT depend on x3x_3x3​.
    • •u2=x2+x32u_2 = x_2 + x_3^2u2​=x2​+x32​ DOES depend on x3x_3x3​.

    So only paths through u2u_2u2​ matter for ∂y2/∂x3\partial y_2/\partial x_3∂y2​/∂x3​.

  2. Step 2: Apply the coordinate chain rule.

    Use

    ∂y2∂x3=∑k=12∂y2∂uk∂uk∂x3.\frac{\partial y_2}{\partial x_3} = \sum_{k=1}^{2}\frac{\partial y_2}{\partial u_k}\frac{\partial u_k}{\partial x_3}.∂x3​∂y2​​=k=1∑2​∂uk​∂y2​​∂x3​∂uk​​.

    Compute each factor:

    • •∂y2∂u1=∂(u1u2)∂u1=u2\frac{\partial y_2}{\partial u_1} = \frac{\partial (u_1u_2)}{\partial u_1} = u_2∂u1​∂y2​​=∂u1​∂(u1​u2​)​=u2​
    • •∂y2∂u2=∂(u1u2)∂u2=u1\frac{\partial y_2}{\partial u_2} = \frac{\partial (u_1u_2)}{\partial u_2} = u_1∂u2​∂y2​​=∂u2​∂(u1​u2​)​=u1​
    • •∂u1∂x3=0\frac{\partial u_1}{\partial x_3} = 0∂x3​∂u1​​=0
    • •∂u2∂x3=2x3\frac{\partial u_2}{\partial x_3} = 2x_3∂x3​∂u2​​=2x3​
  3. Step 3: Sum over k.

    \begin{align*}

    \frac{\partial y_2}{\partial x_3}

    &= \left(\frac{\partial y_2}{\partial u_1}\right)\left(\frac{\partial u_1}{\partial x_3}\right) + \left(\frac{\partial y_2}{\partial u_2}\right)\left(\frac{\partial u_2}{\partial x_3}\right) \

    &= (u_2)(0) + (u_1)(2x_3) \

    &= 2x_3,u_1 \

    &= 2x_3,(x_1x_2).

    \end{align*}

Insight: The summation form is ‘matrix multiplication with indices.’ It forces you to enumerate intermediate coordinates uku_kuk​ and add their contributions—exactly like summing over all paths in a computational graph.

Key Takeaways #

Common Mistakes #

Practice #

easy

Let g:R2→R3g: \mathbb{R}^2\to\mathbb{R}^3g:R2→R3 be g(x1,x2)=(x1,  x1x2,  x22)g(x_1,x_2)=(x_1,; x_1x_2,; x_2^2)g(x1​,x2​)=(x1​,x1​x2​,x22​) and let f:R3→Rf: \mathbb{R}^3\to\mathbb{R}f:R3→R be f(u1,u2,u3)=2u1−u2+u3f(u_1,u_2,u_3)=2u_1-u_2+u_3f(u1​,u2​,u3​)=2u1​−u2​+u3​. Compute ∇(f∘g)(x)\nabla (f\circ g)(\mathbf{x})∇(f∘g)(x).

Hint: Compute Dg(x)Dg(\mathbf{x})Dg(x) (3×2) and ∇f(u)\nabla f(\mathbf{u})∇f(u) (3×1). Then use ∇(f∘g)=Dg⊤∇f\nabla (f\circ g)=Dg^\top\nabla f∇(f∘g)=Dg⊤∇f.

Show solution

We have ∇f(u)=[2−11]\nabla f(\mathbf{u}) = \begin{bmatrix}2\-1\1\end{bmatrix}∇f(u)=​2−11​​.

Compute

Dg(x)=[∂u1/∂x1∂u1/∂x2∂u2/∂x1∂u2/∂x2∂u3/∂x1∂u3/∂x2]=[10x2x102x2].Dg(\mathbf{x})=\begin{bmatrix}
\partial u_1/\partial x_1 & \partial u_1/\partial x_2 \
\partial u_2/\partial x_1 & \partial u_2/\partial x_2 \
\partial u_3/\partial x_1 & \partial u_3/\partial x_2
\end{bmatrix} #

\begin{bmatrix}
1 & 0 \
x_2 & x_1 \
0 & 2x_2
\end{bmatrix}.Dg(x)=​∂u1​/∂x1​∂u2​/∂x1​∂u3​/∂x1​​∂u1​/∂x2​∂u2​/∂x2​∂u3​/∂x2​​​=​1x2​0​0x1​2x2​​​.

Then

∇(f∘g)(x)=Dg(x)⊤ ∇f=[1x200x12x2][2−11]=[2−x2−x1+2x2].\nabla (f\circ g)(\mathbf{x}) = Dg(\mathbf{x})^\top,\nabla f = \begin{bmatrix}1 & x_2 & 0\0 & x_1 & 2x_2\end{bmatrix}\begin{bmatrix}2\-1\1\end{bmatrix} = \begin{bmatrix}2-x_2\-x_1+2x_2\end{bmatrix}.∇(f∘g)(x)=Dg(x)⊤∇f=[10​x2​x1​​02x2​​]​2−11​​=[2−x2​−x1​+2x2​​].

medium

Suppose g:Rn→Rmg: \mathbb{R}^n\to\mathbb{R}^mg:Rn→Rm and f:Rm→Rpf: \mathbb{R}^m\to\mathbb{R}^pf:Rm→Rp. You are told Dg(x)Dg(\mathbf{x})Dg(x) is m×n and Df(u)Df(\mathbf{u})Df(u) is p×m. What is the shape of D(f∘g)(x)D(f\circ g)(\mathbf{x})D(f∘g)(x)? Also, if p=1p=1p=1, what is the shape of ∇(f∘g)(x)\nabla (f\circ g)(\mathbf{x})∇(f∘g)(x) (as a column vector)?

Hint: Use matrix multiplication shapes. For p=1p=1p=1, the Jacobian is 1×n and the gradient is its transpose.

Show solution

D(f∘g)(x)=Df(g(x)) Dg(x)D(f\circ g)(\mathbf{x}) = Df(g(\mathbf{x})),Dg(\mathbf{x})D(f∘g)(x)=Df(g(x))Dg(x) has shape (p×m)(m×n) = p×n.

If p=1p=1p=1, then D(f∘g)D(f\circ g)D(f∘g) is 1×n, so the gradient ∇(f∘g)\nabla (f\circ g)∇(f∘g) (column) has shape n×1.

hard

Let u = g(x) with g:R2→R2g: \mathbb{R}^2\to\mathbb{R}^2g:R2→R2 given by u1=x12u_1=x_1^2u1​=x12​, u2=sin⁡(x2)u_2=\sin(x_2)u2​=sin(x2​). Let scalar ℓ=f(u)\ell=f(\mathbf{u})ℓ=f(u) with f(u1,u2)=u1u2f(u_1,u_2)=u_1u_2f(u1​,u2​)=u1​u2​. Compute ∇xℓ\nabla_{\mathbf{x}}\ell∇x​ℓ.

Hint: Compute ∇uℓ\nabla_{\mathbf{u}}\ell∇u​ℓ first, then multiply by Dg(x)⊤Dg(\mathbf{x})^\topDg(x)⊤. Remember DgDgDg is 2×2 and diagonal here.

Show solution

First, ℓ=u1u2\ell=u_1u_2ℓ=u1​u2​ so

∇uℓ=[∂ℓ/∂u1∂ℓ/∂u2]=[u2u1].\nabla_{\mathbf{u}}\ell = \begin{bmatrix}\partial \ell/\partial u_1\\partial \ell/\partial u_2\end{bmatrix} = \begin{bmatrix}u_2\u_1\end{bmatrix}.∇u​ℓ=[∂ℓ/∂u1​∂ℓ/∂u2​​]=[u2​u1​​].

Next, Dg(x)Dg(\mathbf{x})Dg(x):

Thus

Dg(x)=[2x100cos⁡(x2)].Dg(\mathbf{x})=\begin{bmatrix}2x_1 & 0\0 & \cos(x_2)\end{bmatrix}.Dg(x)=[2x1​0​0cos(x2​)​].

So

∇xℓ=Dg(x)⊤ ∇uℓ=[2x100cos⁡(x2)][u2u1]=[2x1sin⁡(x2)x12cos⁡(x2)].\nabla_{\mathbf{x}}\ell = Dg(\mathbf{x})^\top,\nabla_{\mathbf{u}}\ell = \begin{bmatrix}2x_1 & 0\0 & \cos(x_2)\end{bmatrix}\begin{bmatrix}u_2\u_1\end{bmatrix} = \begin{bmatrix}2x_1\sin(x_2)\x_1^2\cos(x_2)\end{bmatrix}.∇x​ℓ=Dg(x)⊤∇u​ℓ=[2x1​0​0cos(x2​)​][u2​u1​​]=[2x1​sin(x2​)x12​cos(x2​)​].

Connections #

Next up: apply this rule repeatedly and efficiently in neural networks.

Related supporting nodes you may also want:

Quality: A (4.5/5)

← back to treebrowse all →