Residual (Skip) Connections

←Back to Tech Tree

inventorycoverage

Residual (Skip) Connections #

Machine LearningDifficulty: ★★★☆☆Depth: 0Unlocks: 1

Architecture pattern that adds the input of a layer to its output to enable training of very deep networks and ease gradient flow; understand the motivation, basic implementation, and effect on optimization. Residual connections are a structural backbone of Transformer layers.

Interactive Visualization #

⏮◀◀▶▶STEP0.25x1xZOOM

t=0s

Core Concepts #

Key Symbols & Notation #

x, F(x), and + (elementwise addition)

Essential Relationships #

Unlocks (1) #

Transformerslvl 5

Advanced Learning Details

Graph Position #

5

Depth Cost

1

Fan-Out (ROI)

1

Bottleneck Score

0

Chain Length

Cognitive Load #

5

Atomic Elements

32

Total Elements

L1

Percentile Level

L3

Atomic Level

All Concepts (13) #

Teaching Strategy #

Self-serve tutorial - low prerequisites, straightforward concepts.

Deep networks can be powerful, but they’re often hard to train: as you stack more layers, optimization can stall or become unstable. Residual (skip) connections fix a surprisingly big part of this with a tiny structural change: let information (and gradients) flow around a layer by adding the input back to the output.

TL;DR:

A residual block outputs y = x + F(x). Instead of forcing a stack of layers to learn a full mapping H(x), you let it learn a residual F(x) = H(x) − x. This makes the identity mapping easy to represent, improves gradient flow, and enables training much deeper networks. Residual connections are core to modern architectures, including Transformers (via Add & Norm).

What Is a Residual (Skip) Connection? #

The problem it addresses #

When you build a deep neural network, you typically compose many transformations:

In principle, deeper networks can represent more complex functions. In practice, very deep stacks can become hard to optimize: training loss stops improving, gradients become unhelpful, and the model struggles to learn even “simple” behaviors.

Residual (skip) connections are a structural pattern designed to make deep stacks trainable.

The core definition #

A residual connection means a layer (or block) computes:

y = x + F(x)

where:

This is why it’s also called a skip connection: x “skips” around the transform F and is added back in.

Intuition: “don’t overwrite—edit” #

Without a residual connection, a block must output H(x) directly:

With residual connections, the block outputs:

So the block isn’t forced to generate the whole signal from scratch. It can instead learn an edit to the input.

If the best thing to do is “do nothing,” the residual block can set F(x) ≈ 0, giving y ≈ x. That “identity solution” turns out to be extremely helpful for optimization.

Residual mapping viewpoint #

You can rewrite the residual form as:

The representational power is not reduced—H can still be learned—but the parameterization makes certain useful functions (especially identity) much easier to reach during training.

Where you’ll see it #

Residual connections show up in many places:

In Transformers, you’ll repeatedly see patterns like:

That single plus sign is a big part of why very deep Transformer stacks can train at all.

Core Mechanic 1: Identity Shortcuts and Gradient Flow #

Why gradient flow matters #

Training uses gradient-based optimization. If the gradient of the loss with respect to early layers becomes extremely small or chaotic, learning stalls.

Residual connections help by providing a direct path from later layers back to earlier layers.

A minimal gradient derivation (scalar intuition) #

To see the key idea without getting lost in tensor notation, consider a single residual block with scalar input x:

Let L be the loss depending on y. By the chain rule:

∂L/∂x = ∂L/∂y · ∂y/∂x

Now compute ∂y/∂x:

y = x + F(x)

∂y/∂x = 1 + ∂F/∂x

So:

∂L/∂x = ∂L/∂y · (1 + ∂F/∂x)

The critical term is the “1” coming from the skip connection.

This doesn’t magically eliminate vanishing/exploding gradients in every setting, but it adds a stable component to gradient propagation.

Vector/tensor form (same idea) #

Let x ∈ ℝᵈ, and F maps ℝᵈ → ℝᵈ. Then:

y = x + F(x)

The Jacobian is:

y/∂x = I + ∂F/∂x

So the gradient to earlier layers includes an identity term I. Across many stacked blocks, this means there exists a path where the gradient multiplies by (approximately) identity instead of repeatedly multiplying by potentially small/large Jacobians.

Composition across many layers #

Consider a deep stack of residual blocks:

x₀ = input

For k = 0…N−1:

xₖ₊₁ = xₖ + Fₖ(xₖ)

If all Fₖ are near 0 early in training, then:

xₖ₊₁ ≈ xₖ

So the entire deep network initially behaves close to the identity mapping—meaning signals don’t get destroyed just because you made the network deep.

This is one of the most important practical effects:

You can add more layers to gain capacity, and training remains feasible because the network can default to “almost identity” until each block learns useful residual corrections.

Residual connections as an optimization “safety net” #

A helpful mental model:

This shifts the optimization landscape toward something easier to navigate.

A quick comparison table #

PropertyPlain stack (no skip)Residual stack
Block formy = F(x)y = x + F(x)
Identity mappingMust be learned by parametersAchieved by F(x) = 0
Gradient termF/∂xI + ∂F/∂x
Deep behavior early in trainingCan distort signalsOften close to identity
Practical depthLimitedMuch larger

Residual connections don’t replace good initialization, normalization, and learning-rate choices—but they make deep learning dramatically more robust.

Core Mechanic 2: Residual Mapping F(x) Learns “the Difference” #

Why learn a difference instead of the whole function? #

Suppose the function you ultimately want is H(x). In many deep models, especially when stacking many similar blocks, H(x) may be close to x much of the time.

Examples of “close to identity” needs:

If H(x) ≈ x, then the difference H(x) − x is small.

Residual blocks make the model explicitly represent that difference:

F(x) = H(x) − x

This can be easier to learn because:

  1. 1)Small corrections can be represented with small weights.
  2. 2)Many blocks can each contribute a modest update.
  3. 3)If a block is not needed, it can “opt out” by learning F(x) ≈ 0.

The “easy identity” argument, with a bit more structure #

Imagine a block F is implemented as a small network, like:

F(x) = W₂ σ(W₁ x)

If you want the overall mapping to be identity (y = x), then in a plain network you would need:

W₂ σ(W₁ x) = x

That’s a strong constraint; depending on σ, it might be awkward or require special parameter values.

In a residual network, identity is just:

y = x + W₂ σ(W₁ x) = x

which is achieved whenever:

W₂ σ(W₁ x) = 0

A sufficient condition is simply:

W₂ = 0 (or very small)

This shows why “do nothing” is naturally available to the model.

Residual blocks as incremental updates (an optimization lens) #

Rearrange the residual update:

xₖ₊₁ = xₖ + Fₖ(xₖ)

This looks like an iterative refinement method:

If you squint, it resembles a discretized dynamical system:

xₖ₊₁ − xₖ = Fₖ(xₖ)

This perspective helps explain why very deep residual networks behave like many small steps rather than a few huge jumps.

Implementation details that matter #

Residual connections are conceptually simple, but practical implementations need two details:

1) Shape matching #

Elementwise addition requires the same shape:

If shapes differ, common strategies include:

y = P(x) + F(x)

2) Where normalization goes (important in Transformers) #

In Transformer literature you’ll see:

Both use residual addition; they differ in optimization behavior and stability for very deep stacks. The key idea for this node is the residual path itself: the “+ x” that preserves a direct route for information.

Residual connections are not (just) “more capacity” #

It’s tempting to think residual connections add representational power. But the main win is reparameterization:

That’s why residual connections are discussed primarily as an optimization technique—even though they also interact with generalization and training dynamics in interesting ways.

Application/Connection: Residuals in Transformer Layers (Add & Norm) #

Why Transformers lean on residual connections #

A Transformer layer has multiple complex subcomponents:

Training would be much harder if each layer had to fully rewrite token representations. Residual connections make each sublayer behave like a controlled update to the current representation.

Canonical Transformer sublayer pattern #

For a token representation vector x (really a matrix of token vectors), a typical sublayer wrapper is:

y = x + Sublayer(x)

Then normalization is applied either before or after depending on design.

A simplified (pre-norm) Transformer block looks like:

  1. Attention sublayer

x₁ = x + Attention(LayerNorm(x))

  1. MLP sublayer

x₂ = x₁ + MLP(LayerNorm(x₁))

Each residual addition means:

Interpreting attention as a residual update #

Self-attention produces context-aware mixtures of token information. But residual addition ensures:

This is especially important early in training, when attention weights may be noisy.

Practical effects you can observe #

In real training runs, residual connections often enable:

Residuals are not sufficient alone—Transformers also rely heavily on normalization, learning-rate schedules, and careful initialization. But the skip connection is a structural backbone.

When residual connections can be tricky #

Residuals are usually beneficial, but you still must manage:

A common engineering tactic is to ensure the residual branch starts small (e.g., initialization choices), so early training behaves close to identity and gradually learns stronger updates.

Big picture connection #

Residual connections are a pattern you’ll see across architectures because they solve a general problem:

Transformers are one of the most important modern examples, but the core math is the same: y = x + F(x).

Worked Examples (3) #

Example 1: Residual vs plain block—compute gradients (scalar case) #

Let F(x) = a·x (a is a scalar parameter). Compare a plain block y = F(x) to a residual block y = x + F(x). Suppose the loss is L = ½·(y − t)² for target t. Compute ∂L/∂x for both cases.

  1. Plain block:

    y = F(x) = a·x

    L = ½·(a·x − t)²

  2. Compute ∂L/∂y:

    ∂L/∂y = (y − t)

  3. Compute ∂y/∂x:

    y = a·x ⇒ ∂y/∂x = a

  4. Chain rule:

    ∂L/∂x = ∂L/∂y · ∂y/∂x

    = (y − t) · a

    = (a·x − t) · a

  5. Residual block:

    y = x + F(x) = x + a·x = (1 + a)·x

    L = ½·((1 + a)·x − t)²

  6. Compute ∂y/∂x:

    y = (1 + a)·x ⇒ ∂y/∂x = (1 + a)

    (Equivalently: y = x + a·x ⇒ ∂y/∂x = 1 + a.)

  7. Chain rule:

    ∂L/∂x = (y − t) · (1 + a)

    = (((1 + a)·x) − t) · (1 + a)

Insight: In the plain case, the backprop multiplier is a. If a is small, gradients shrink. In the residual case, the multiplier is (1 + a), which includes a built-in 1 from the skip path—making it harder for gradients to vanish purely because the learned transform is weak early in training.

Example 2: Residual mapping view—convert between H(x) and F(x) #

Assume the desired function is H(x) = 3x + 2 (scalar). If we implement a residual block y = x + F(x), what residual function F(x) must be learned? Verify that the residual formulation reproduces H(x).

  1. Start from the relationship:

    H(x) = x + F(x)

    So:

    F(x) = H(x) − x

  2. Plug in H(x) = 3x + 2:

    F(x) = (3x + 2) − x

    = 2x + 2

  3. Verify by substitution:

    y = x + F(x)

    = x + (2x + 2)

    = 3x + 2

    = H(x)

Insight: Residual learning doesn’t restrict what you can represent; it changes what the block is asked to model. If H(x) is close to x, then F(x) is small—often easier to learn robustly with gradient-based optimization.

Example 3: Shape matching—when you need a projection shortcut #

Let x ∈ ℝ³, but suppose your transform produces F(x) ∈ ℝ² (e.g., you changed hidden size). You cannot add x + F(x) directly. Show how a projection P fixes this, and write the new residual form.

  1. Problem: elementwise addition requires matching shapes.

    x ∈ ℝ³

    F(x) ∈ ℝ²

    So x + F(x) is undefined.

  2. Introduce a learned projection P: ℝ³ → ℝ², often linear:

    P(x) = W x + b

    where W is 2×3 and b ∈ ℝ².

  3. Define the residual output:

    y = P(x) + F(x)

    Now both terms are in ℝ², so the sum is valid.

  4. Interpretation:

    • •the skip path still exists, but it now passes through a simple projection
    • •the main branch learns a correction in the new space

Insight: Residual connections fundamentally require an additive merge. If dimensions change, you can keep the residual idea by making the shortcut path match dimensions via a cheap projection.

Key Takeaways #

Common Mistakes #

Practice #

easy

Let x ∈ ℝᵈ and a residual block be y = x + F(x). Show (using Jacobians) that ∂y/∂x = I + ∂F/∂x.

Hint: Differentiate y componentwise: yᵢ = xᵢ + Fᵢ(x). What is ∂xᵢ/∂xⱼ?

Show solution

Write yᵢ = xᵢ + Fᵢ(x).

Then for each i, j:

∂yᵢ/∂xⱼ = ∂xᵢ/∂xⱼ + ∂Fᵢ/∂xⱼ

But ∂xᵢ/∂xⱼ = 1 if i = j, else 0, which is exactly the identity matrix I.

So the Jacobian is:

y/∂x = I + ∂F/∂x.

medium

Suppose a deep network is a stack of N residual blocks: xₖ₊₁ = xₖ + Fₖ(xₖ). If all Fₖ are exactly 0, what function does the whole network compute? What does this imply about adding more layers?

Hint: If Fₖ = 0, then xₖ₊₁ = xₖ. Unroll the recurrence.

Show solution

If Fₖ ≡ 0 for every k, then:

x₁ = x₀ + 0 = x₀

x₂ = x₁ + 0 = x₁ = x₀

...

x_N = x₀

So the whole network computes the identity function.

Implication: adding layers does not force the network to change behavior; the model can default to identity and only learn nonzero residuals where useful. This makes very deep architectures easier to optimize.

hard

You have a residual block where x ∈ ℝ⁶, but the transform produces F(x) ∈ ℝ⁴. Propose two ways to make a valid residual connection, and write the resulting equations.

Hint: Either change the transform to output ℝ⁶, or project the skip path into ℝ⁴ (or project the transform into ℝ⁶).

Show solution

Two valid approaches:

  1. Keep the skip path unchanged and make the transform match ℝ⁶:
  1. Use a projection shortcut from ℝ⁶ → ℝ⁴:

(Variants: you could also project F(x) up to ℝ⁶ and add to x, but typically you project the shortcut when changing widths.)

Connections #

Transformers

Related next nodes you’ll often want after this:

Quality: A (4.4/5)

← back to treebrowse all →