Computational Graphs

←Back to Tech Tree

inventorycoverage

Computational Graphs #

Machine LearningDifficulty: ★★★☆☆Depth: 0Unlocks: 6

A representation of mathematical expressions as directed graphs where nodes are operations and edges are tensors, used to structure and execute forward and backward passes in neural network computation. They make it clear how values flow and where gradients need to be propagated for training.

Interactive Visualization #

⏮◀◀▶▶STEP0.25x1xZOOM

t=0s

Core Concepts #

Key Symbols & Notation #

f(...) (operation/node)d/dx (gradient/derivative operator)

Essential Relationships #

Unlocks (3) #

Deep Learninglvl 5Automatic Differentiationlvl 4Numerical Stability and Conditioninglvl 4

Advanced Learning Details

Graph Position #

6

Depth Cost

6

Fan-Out (ROI)

2

Bottleneck Score

0

Chain Length

Cognitive Load #

6

Atomic Elements

37

Total Elements

L2

Percentile Level

L4

Atomic Level

All Concepts (15) #

Teaching Strategy #

Deep-dive lesson - accessible entry point but dense material. Use worked examples and spaced repetition.

Neural networks look like “layers”, but training them is really about one thing: computing lots of derivatives efficiently and correctly. Computational graphs are the picture (and data structure) that makes both the forward computation and the backward (gradient) computation explicit—so we can automate learning.

TL;DR:

A computational graph represents a function as a directed graph: nodes are operations f(…), edges carry tensors (values). A forward pass evaluates node outputs in dependency order. A backward pass (reverse-mode autodiff / backprop) propagates gradients from outputs to inputs using the chain rule, accumulating contributions when a value fans out to multiple consumers.

What Is a Computational Graph? #

A computational graph is a directed graph that represents how a mathematical expression (or program) computes its outputs from its inputs.

Why we need this picture at all:

A computational graph makes two things explicit:

  1. Data flow (forward computation): how numbers/tensors are produced.

  2. Dependency structure (for derivatives): which intermediate values influence which outputs.

The two core roles: nodes and edges #

This is the key mental model:

A tiny example (scalar) #

Suppose we define

z = (x + y) · y

We can break this into intermediate values:

Graphically (conceptually): x and y flow into “+” producing a, and then a and y flow into “×” producing z.

Even at this size, the graph is useful because it tells us:

Tensors, not just scalars #

In ML, edges typically carry vectors/matrices/tensors. We’ll use v for vectors and uppercase like W for matrices.

The graph doesn’t care whether values are scalars or tensors; it’s the same idea. The only difference is that “derivatives” become vector-Jacobian and Jacobian-vector products under the hood.

Why graphs are central in deep learning frameworks #

Computational graphs are the backbone of systems like PyTorch, JAX, and TensorFlow.

They enable:

You can think of a computational graph as the “assembly language” of differentiable programs: explicit enough to analyze, optimize, and differentiate.

Core Mechanic 1: Forward Pass — Values Flow Along Edges #

The forward pass is simply: evaluate every node in an order that respects dependencies.

Why forward order matters #

If node B uses the output of node A, then A must run before B. This is a partial order induced by the directed edges.

Most computational graphs used for standard feed-forward neural networks are acyclic (a DAG: directed acyclic graph). For a DAG, a topological ordering exists.

Example: a small DAG #

Let

u = x · y

w = sin(ν)

z = w + y

Dependencies:

A valid forward order is: compute ν → w → z.

What is stored during forward? #

Here is a subtle but important point: for backprop, you typically need some intermediate values.

So most autodiff systems either:

This leads to a core engineering trade-off:

StrategyWhat you storeMemoryExtra computeCommon use
Store intermediatesν, w, …higherlowdefault training
Recompute (checkpoint)fewer valueslowerhighervery deep nets

Multiple consumers (fan-out) #

A single node output may feed many later nodes. That means:

Example:

Here a is used in two places. During backprop, the gradient ∂L/∂a has two contributions:

You add them because the total derivative is linear in contributions from different paths.

Shapes are part of the contract #

Edges carry tensors with shapes. Each node must obey shape rules.

Computational graphs make shape mismatches easier to catch because every edge has a well-defined produced tensor.

Takeaway #

Forward mode is “just computation,” but the graph clarifies:

That clarity becomes crucial when we reverse direction and compute gradients.

Core Mechanic 2: Backward Pass — Reverse-Mode Autodiff (Backprop) on the Graph #

Training needs gradients: if L is a loss and θ are parameters, we want ∂L/∂θ.

Why reverse-mode? #

In deep learning, we usually have:

Reverse-mode autodiff computes all ∂L/∂θ in about the cost of a small constant multiple of the forward pass. That’s why it’s the default for neural nets.

Local derivatives + chain rule #

Each node performs an operation. Backprop works by combining:

  1. An upstream gradient: how the loss changes with respect to the node’s output.

  2. A local derivative: how the node’s output changes with respect to its inputs.

Then it produces downstream gradients for each input.

If a node computes

u = f(a)

then by the chain rule:

dL/da = (dL/dν) · (dν/da)

In graph terms:

A concrete scalar derivation #

Let

a = x + y

z = a · y

Assume L = z (so L = (x + y) · y).

Forward:

Backward:

So:

So:

Now combine contributions to y (because y influenced L via two paths):

dL/dy = a + y = (x + y) + y = x + 2y

This “sum the contributions at fan-out” rule is one of the most important operational facts about backprop.

Gradient accumulation at merges #

Whenever an edge’s value is used in multiple places, its gradient is the sum of gradients coming back from each usage.

Operationally, frameworks maintain an accumulator for each node output:

Vector/tensor intuition (without heavy Jacobians) #

For tensors, the same idea holds but with appropriate linear algebra.

Example: y = W x

If we have upstream gradient g = dL/dy (a vector in ℝᵐ), then:

You don’t need to explicitly form Jacobians; backprop uses efficient vector-Jacobian products.

Reverse traversal order #

To compute gradients correctly, you need to process nodes in reverse topological order (from outputs back to inputs). In a DAG, this is well-defined.

What about branches and shared subgraphs? #

Branches are fine; they just introduce multiple gradient contributions that must sum.

Shared subgraphs (reuse) are also fine; they are exactly “fan-out.”

What about cycles (RNNs)? #

Some models have logical cycles (recurrent connections). Training still uses backprop by unrolling the cycle over time steps, yielding a larger DAG.

Takeaway #

Reverse-mode autodiff is “the chain rule on a graph”:

Application/Connection: How Computational Graphs Power Deep Learning Practice #

Computational graphs aren’t just a conceptual tool; they determine how training runs on real hardware.

1) Defining models as graphs #

A neural net layer is a subgraph pattern:

Stacking layers composes graphs.

A whole training step includes:

  1. Forward pass to compute predictions and loss L

  2. Backward pass to compute gradients ∂L/∂θ

  3. Parameter update (e.g., θ ← θ − η ∂L/∂θ)

That’s “forward graph” + “backward graph” + “optimizer graph.”

2) Debugging: you can localize problems #

If gradients are NaN or 0, a graph helps you ask:

This connects directly to numerical stability: many stability tricks (like log-sum-exp) are specific subgraph rewrites.

3) Performance: graphs enable optimizations #

Once computation is explicit, systems can:

The graph is also used for device placement:

4) Static vs dynamic graphs (conceptual comparison) #

Different frameworks expose graphs differently.

StyleHow graph is builtProsConsExamples
Static graphdefine graph, then runglobal optimization, easier deploymentless flexible debugging, control flow harderTF 1.x, XLA-compiled graphs
Dynamic graphgraph created as code runsvery flexible, Pythonic debuggingfewer global optimizations unless compiledPyTorch eager, JAX tracing (hybrid)

Modern systems blur the line (e.g., torch.compile, JAX jit): write dynamic code, then trace/compile into a more static graph for speed.

5) Connecting forward/backward to learning #

A central reason computational graphs matter is that learning signals (gradients) follow the same dependency paths as values—but reversed.

That provides a practical “sanity check” when designing architectures.

6) Where this node unlocks next #

Computational graphs are the map; autodiff is the navigation algorithm; deep learning is the city you can now traverse.

Worked Examples (3) #

Example 1: Backprop on a Small Scalar Graph with a Branch (Fan-out) #

Define a computational graph with

u = x + y

z = u · y

w = u²

L = z + w

Goal: compute dL/dx and dL/dy using the graph + chain rule.

  1. Forward compute intermediate values:

    • •u = x + y
    • •z = u · y
    • •w = u²
    • •L = z + w
  2. Initialize gradient at the output:

    • •dL/dL = 1

    Because L = z + w, local derivatives are:

    • •dL/dz = 1
    • •dL/dw = 1
  3. Backprop through w = u²:

    • •dw/du = 2u

    So contribution to u from this path:

    • •dL/du |_from_w = dL/dw · dw/du = 1 · 2u = 2u
  4. Backprop through z = u · y:

    Local derivatives:

    • •dz/du = y
    • •dz/dy = u

    So:

    • •dL/du |_from_z = dL/dz · dz/du = 1 · y = y
    • •dL/dy |_from_z = dL/dz · dz/dy = 1 · u = u
  5. Accumulate gradients at u (fan-out merge):

    • •dL/du = (dL/du |_from_w) + (dL/du |_from_z)
    • •dL/du = 2u + y
  6. Backprop through u = x + y:

    Local derivatives:

    • •du/dx = 1
    • •du/dy = 1

    So:

    • •dL/dx = dL/du · du/dx = (2u + y) · 1 = 2u + y
    • •dL/dy |_from_u = dL/du · du/dy = (2u + y) · 1 = 2u + y
  7. Accumulate total gradient for y (two paths):

    • •dL/dy = (dL/dy |_from_z) + (dL/dy |_from_u)
    • •dL/dy = u + (2u + y) = 3u + y
  8. Substitute u = x + y:

    • •dL/dx = 2(x + y) + y = 2x + 3y
    • •dL/dy = 3(x + y) + y = 3x + 4y

Insight: Whenever a value is used by multiple later operations (fan-out), its gradient is the sum of gradients flowing back along each outgoing path. This is the graph version of “collect all chain rule paths and add them.”

Example 2: Vector Computational Graph for a Linear Layer + Squared Error #

Let x ∈ ℝⁿ, W ∈ ℝᵐˣⁿ, b ∈ ℝᵐ, target t ∈ ℝᵐ.

Define:

Compute gradients dL/dy, dL/dW, dL/db, dL/dx.

  1. Rewrite loss with an intermediate:

    Let e = yt.

    Then L = 1/2 ‖e‖² = 1/2 ∑ᵢ eᵢ².

  2. Compute gradient w.r.t. e:

    For each component, ∂L/∂eᵢ = 1/2 · 2eᵢ = eᵢ.

    So in vector form:

    • •dL/de = e.
  3. Backprop through e = yt:

    • •∂e/∂y is identity, and t is constant.

    So:

    • •dL/dy = dL/de = e = (yt).
  4. Backprop through y = W x + b:

    Split into two nodes: u = W x, then y = u + b.

    Upstream gradient is g = dL/dy.

    For y = u + b:

    • •dL/du = g
    • •dL/db = g

    For u = W x:

    Using matrix calculus identities:

    • •dL/dW = g x
    • •dL/dx = Wᵀ g
  5. Optionally substitute g = (yt) and y = W x + b:

    • •dL/db = W x + bt
    • •dL/dW = (W x + bt) x
    • •dL/dx = Wᵀ (W x + bt)

Insight: Computational graphs scale because each node only needs a local backward rule (like for + or matmul). Reverse-mode AD stitches those local rules into full gradients without ever building a giant Jacobian explicitly.

Example 3: A Numerically Sensitive Subgraph and a Stable Rewrite (Softmax + Log) #

Consider probabilities from logits s ∈ ℝᵏ via softmax:

If you then compute log-prob for class c:

Show how the computational graph suggests a stable rewrite using log-sum-exp.

  1. Start from the definition:

    softmax(s)꜀ = exp(s꜀) / ∑ⱼ exp(sⱼ)

    So:

    L = −log(exp(s꜀) / ∑ⱼ exp(sⱼ)).

  2. Use log rules:

    L = −(log(exp(s꜀)) − log(∑ⱼ exp(sⱼ)))

    = −(s꜀ − log(∑ⱼ exp(sⱼ)))

    = log(∑ⱼ exp(sⱼ)) − s꜀.

  3. Identify numerical issue in the original graph:

    The naive path computes exp(sⱼ) which can overflow if some sⱼ is large.

    Graph insight: the problematic node is exp(·) applied to raw logits.

  4. Apply the log-sum-exp trick by subtracting m = maxⱼ sⱼ:

    ∑ⱼ exp(sⱼ) = ∑ⱼ exp((sⱼ − m) + m)

    = exp(m) ∑ⱼ exp(sⱼ − m)

    So:

    log(∑ⱼ exp(sⱼ)) = log(exp(m) ∑ⱼ exp(sⱼ − m))

    = m + log(∑ⱼ exp(sⱼ − m)).

  5. Final stable expression:

    Let m = maxⱼ sⱼ.

    Then:

    L = (m + log(∑ⱼ exp(sⱼ − m))) − s꜀.

    All exponentials now see inputs (sⱼ − m) ≤ 0, reducing overflow risk.

Insight: Computational graphs aren’t just for gradients; they help you spot unstable subgraphs (like exp feeding into sums and logs) and replace them with equivalent but stable subgraphs.

Key Takeaways #

Common Mistakes #

Practice #

easy

Let a = x · y, b = a + y, L = b². Compute dL/dx and dL/dy.

Hint: Do forward intermediates a, b. Backprop from L to b to (a, y) and then to (x, y). Remember y affects b both directly and via a.

Show solution

Forward:

a = x y

b = a + y = x y + y = y(x + 1)

L = b²

Backward:

dL/db = 2b

For b = a + y:

For a = x y:

Total:

dL/dx = 2by

dL/dy = 2b + 2bx = 2b(1 + x)

Substitute b = y(x + 1):

dL/dx = 2 y(x + 1) · y = 2 y² (x + 1)

dL/dy = 2 y(x + 1) (1 + x) = 2 y (x + 1)²

medium

Consider y = ReLU(W x + b) and loss L = ∑ᵢ yᵢ (sum of outputs). Express dL/dW in terms of x and a mask derived from the pre-activation h = W x + b.

Hint: Use dL/dy = 1 (vector of ones). ReLU’(h) is 1 where hᵢ > 0 else 0. Then dL/dW looks like an outer product.

Show solution

Let h = W x + b, y = ReLU(h).

Given L = ∑ᵢ yᵢ, we have dL/dy = 1 (all ones).

Backprop through ReLU:

Define mask m where mᵢ = 1 if hᵢ > 0 else 0.

Then dL/dh = dL/dym = 1m = m.

Backprop through h = W x + b:

For matrix multiply, dL/dW = (dL/dh) xᵀ = m xᵀ.

So dL/dW is an outer product of the ReLU mask with the input x.

hard

A value u feeds into two branches: v = u² and w = 3u. The final output is L = v · w. Compute dL/du by treating it as a computational graph and showing the gradient accumulation clearly.

Hint: Compute dL/dv and dL/dw first from L = v w. Then push back to u through each branch and add.

Show solution

Forward:

v = u²

w = 3u

L = v w

Backward:

From L = v w:

Backprop through v = u²:

So contribution:

dL/du |_via_v = dL/dv · dv/du = w · 2u

Backprop through w = 3u:

So contribution:

dL/du |_via_w = dL/dw · dw/du = v · 3

Accumulate at u:

dL/du = (w · 2u) + (v · 3)

Substitute v = u² and w = 3u:

dL/du = (3u · 2u) + (u² · 3)

= 6u² + 3u²

= 9u².

Connections #

Next nodes you can study:

Related conceptual building blocks:

Quality: A (4.6/5)

← back to treebrowse all →