Transformers

←Back to Tech Tree

inventorycoverage

Transformers #

Machine LearningDifficulty: ★★★★★Depth: 14Unlocks: 0

Attention-based architecture. Multi-head attention, positional encoding.

Interactive Visualization #

⏮◀◀▶▶STEP0.25x1xZOOM

t=0s

Core Concepts #

Key Symbols & Notation #

d_model (model hidden dimension)h (number of attention heads)d_k (per-head key/query dimension)

Essential Relationships #

Prerequisites (7) #

Attention Mechanisms6 atomsLayer Normalization6 atomsPositional Encoding6 atomsToken Embeddings6 atomsSoftmax and Logits5 atomsResidual (Skip) Connections5 atomsSequence Masking (causal and padding masks)5 atoms

Advanced Learning Details

Graph Position #

290

Depth Cost

0

Fan-Out (ROI)

0

Bottleneck Score

14

Chain Length

Cognitive Load #

9

Atomic Elements

45

Total Elements

L3

Percentile Level

L4

Atomic Level

All Concepts (15) #

Teaching Strategy #

Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.

Transformers are the first widely successful neural architecture where the main “engine” is not recurrence or convolution, but a learned, content-based routing system: attention. Once you understand the exact mechanics of scaled dot-product attention, multi-head attention, and the Transformer block (attention + feed-forward + residual + layer norm), most modern language and vision models become variations on a single theme.

TL;DR:

A Transformer processes a sequence by projecting token representations into queries, keys, and values, computing attention weights via softmax(QKᵀ/√d_k) (with masks as needed), mixing values with those weights, and then applying a position-wise MLP—each sublayer wrapped with residual connections and layer normalization. Multi-head attention repeats attention in parallel with different projections, concatenates head outputs, and mixes them back to d_model. Stacking these blocks yields encoders and decoders; decoders add causal masking and cross-attention to an encoder output.

What Is a Transformer? #

Why Transformers exist (motivation) #

Before Transformers, sequence modeling was dominated by RNNs/LSTMs/GRUs and CNN-based sequence models. Those families have two persistent pain points:

  1. Long-range dependencies are hard. Even with gating, recurrent models struggle to move information across hundreds or thousands of steps.

  2. Parallelism is limited. Recurrence is inherently sequential: to compute step t, you need step t−1. That slows training.

Transformers address both by making the core operation all-to-all token interaction in a single layer: every token can “look at” every other token (subject to masking). This interaction is differentiable, learnable, and highly parallelizable on GPUs/TPUs.

The Transformer idea in one sentence #

A Transformer layer repeatedly does:

Crucially, both steps are wrapped in residual connections and layer normalization for stable optimization.

What a token representation looks like #

Assume a sequence length L and model width d_model.

Because attention has no inherent sense of order, we add positional information:

The two canonical Transformer families #

FamilyPrimary useBlocksKey masking
Encoder-only (e.g., BERT-style)Understanding, classification, bidirectional contextself-attn + FFNPadding mask only (no causal mask)
Decoder-only (e.g., GPT-style)Autoregressive generationmasked self-attn + FFNCausal + padding masks
Encoder–decoder (original)Seq2seq (translation, summarization)encoder: self-attn + FFN; decoder: masked self-attn + cross-attn + FFNDecoder uses causal; cross-attn uses padding mask on encoder side

This lesson focuses on the core mechanics that all of these share: scaled dot-product attention, multi-head attention, and the Transformer layer structure.

Core Mechanic 1: Scaled Dot-Product Self-Attention (Q, K, V) #

Why attention is “routing” #

Suppose token i needs information from token j (e.g., a pronoun needs its antecedent). Attention lets token i compute a weighted mixture of other tokens’ information.

The key design choice is: weights depend on content (learned similarity), not just distance.

Queries, keys, values (the roles) #

Each token representation xᵢ is linearly projected into three vectors:

In matrix form for a whole sequence X ∈ ℝ^{L×d_model}:

where W_Q, W_K, W_V are learned matrices.

Typically:

In most standard implementations, d_v = d_k = d_model / h per head.

Attention scores and why we scale by √d_k #

The raw similarity between token i and token j is the dot product:

In matrix form, the score matrix is:

If q and k components have roughly unit variance, then the dot product grows with dimension:

Large magnitudes push softmax into saturation, giving tiny gradients. To keep logits in a reasonable range, we scale:

This is “scaled dot-product attention.”

Softmax to turn scores into weights #

For each query position i, we take a softmax over j:

So Aᵢⱼ ≥ 0 and ∑ⱼ Aᵢⱼ = 1.

Interpretation: row i of A is a probability distribution over which tokens i will attend to.

Masking (padding and causal) #

Masks are incorporated by adding −∞ (or a very negative number) to disallowed positions before softmax.

Let M ∈ ℝ^{L×L} where:

Then:

Two common masks:

  1. Padding mask: disallow attending to padding tokens.

  2. Causal mask (decoder): disallow attending to future tokens, i.e., j > i.

Weighted sum of values #

Finally, the output at each position i is a weighted sum of values:

In matrix form:

Where O ∈ ℝ^{L×d_v}.

Putting it together (single-head self-attention) #

The full formula:

A geometric intuition #

Dot products measure alignment. If qᵢ points in a similar direction to kⱼ, token i will attend more to token j. But because W_Q and W_K are learned, the model can invent similarity notions that match the task (syntax, coreference, topic, etc.).

Complexity note (why long context is expensive) #

Self-attention builds an L×L matrix of scores. That’s:

This quadratic dependence is why long-context Transformers require approximations or architectural tricks—but the core mechanism remains the same.

Core Mechanic 2: Multi-Head Attention (MHA) #

Why one attention “view” isn’t enough #

A single attention map must decide one set of weights per token. But language often requires multiple simultaneous relationships:

Multi-head attention lets the model compute multiple attention distributions in parallel, each in its own learned subspace.

The shape story: d_model, h, d_k #

Let:

Commonly:

Example: d_model = 768, h = 12 ⇒ d_k = 64.

Per-head projections #

For head r ∈ {1,…,h}, we have separate projection matrices:

Compute:

Then each head output:

Concatenate and mix #

Concatenate head outputs along the feature dimension:

If each O^{(r)} ∈ ℝ^{L×d_v} and d_v = d_k, then O_concat ∈ ℝ^{L×(h d_k)} = ℝ^{L×d_model}.

Then apply a final learned output projection:

This final mixing matters: it lets the model combine information across heads.

Self-attention vs cross-attention inside MHA #

Multi-head attention is a pattern, and it can be used in different places:

Cross-attention formula:

Here the decoder learns to retrieve information from the encoded source sequence.

Practical interpretation of heads #

A helpful mental model:

So attention isn’t only where to look; it’s also what to bring back.

A note on head dimension choice #

Holding d_model fixed, increasing h decreases d_k. There is a trade-off:

ChoiceBenefitCost
More heads (higher h)More parallel subspacesSmaller d_k per head (less capacity per head), overhead
Fewer headsMore capacity per headFewer distinct attention patterns

Empirically, standard settings (8–32 heads depending on width) work well, but variants exist (multi-query, grouped-query attention) to reduce memory/compute during decoding.

Core Mechanic 3: The Transformer Layer (Sublayers, Residuals, LayerNorm, FFN) #

Why the Transformer is a stack of simple blocks #

A single attention layer can mix tokens once, but deep language understanding requires multiple rounds of:

So Transformers stack identical blocks. The stability of deep stacking relies on residual connections and layer normalization.

The canonical encoder layer #

An encoder layer has two sublayers:

  1. Multi-head self-attention (MHA)

  2. Position-wise feed-forward network (FFN)

Each sublayer is wrapped by residual + layer norm. There are two common normalization conventions:

ConventionPatternNotes
Post-LN (original 2017)X + Sublayer(X) → LNCan be harder to optimize at great depth
Pre-LN (common today)X → LN → Sublayer → X + …Typically more stable for deep stacks

We’ll write pre-LN, since it’s widely used.

Pre-LN encoder layer equations #

Let X ∈ ℝ^{L×d_model} be the layer input.

(1) Attention sublayer

(2) Feed-forward sublayer

Output is Y.

What “position-wise FFN” means #

The FFN is the same MLP applied independently to each position.

Typical form:

Where:

Even though FFN doesn’t mix tokens, it adds substantial capacity: it can reshape and recombine features within each token vector.

The decoder layer #

A decoder layer adds one more attention sublayer:

  1. Masked multi-head self-attention (causal)

  2. Cross-attention over encoder outputs (optional in encoder–decoder)

  3. Feed-forward network

Pre-LN decoder (encoder–decoder) sketch:

If it’s decoder-only (GPT-style), you omit the cross-attention term.

Why residual connections matter (conceptual) #

Residuals let the model learn modifications rather than complete rewrites.

If a sublayer initially does something unhelpful, the residual path preserves the input:

This makes gradients flow more directly through many layers. In deep Transformers (dozens to hundreds of layers), residual pathways are essential.

Why layer normalization is placed around sublayers #

Layer norm stabilizes the scale of activations, making training less sensitive to initialization and learning rate.

Layer norm operates per token vector xᵢ:

(You already know LN; here it matters because attention logits and FFN activations can drift in magnitude as depth increases.)

Where positional encoding enters #

Positional information is usually added once at the bottom:

Then all layers operate on X₀, X₁, …, X_N.

However, there are alternatives (relative position bias, rotary embeddings), but the principle remains: attention needs a way to distinguish positions.

Application/Connection: How Transformers Are Used (Encoder, Decoder, Training Objectives, and Practical Concerns) #

Encoder-only Transformers (bidirectional) #

Use case: classification, retrieval, tagging, masked language modeling.

Mechanics:

To classify, you might pool:

Decoder-only Transformers (autoregressive) #

Use case: text generation, code generation, next-token prediction.

Mechanics:

Training objective (next-token):

If logits at position i are zᵢ ∈ ℝ^{|Vocab|}, then:

Loss is cross-entropy summed across positions.

Encoder–decoder Transformers (seq2seq) #

Use case: translation, summarization, speech-to-text.

Mechanics:

Practical issues that shape real implementations #

1) KV caching during decoding #

Autoregressive decoding generates one token at a time. Recomputing attention over the whole prefix is expensive. Instead, store previous keys/values.

At time step t:

This reduces per-step cost from O(t²) to roughly O(t) for attention score computation (still linear in context length per step).

2) Attention masks are not optional #

If you forget masking:

Mechanically, masks must be added before softmax.

3) Initialization and normalization choices #

Deep Transformers are sensitive to:

These details often decide whether training is stable.

4) Why positional encoding is central #

Self-attention is permutation-invariant without positional signals:

If you permute tokens, QKᵀ permutes correspondingly, producing the same pattern up to permutation. Positional encoding breaks this symmetry, enabling order-sensitive tasks.

Connecting the mechanics to behavior #

When you inspect trained models, attention heads often develop recognizable patterns:

Not every head is interpretable, and attention weights are not the whole story (FFN and residual streams matter), but the routing intuition remains useful.

Summary of the full forward pass (encoder-only) #

Given tokens → embeddings + positions:

For ℓ = 1…N layers:

Final outputs X_N feed task heads (classification, MLM head, etc.).

For decoder-only, the same structure holds, but with causal masking and an output projection to vocabulary logits.

Worked Examples (3) #

Worked Example 1: Compute single-head scaled dot-product attention by hand (tiny numbers) #

We will compute attention for a sequence of L = 2 tokens with d_k = d_v = 2. Use no mask. Let

Q = [[1, 0],

[0, 1]]

K = [[1, 0],

[1, 1]]

V = [[1, 2],

[3, 4]]

All matrices are in row-major form: each row corresponds to a token position.

  1. Step 1: Compute raw score matrix S = QKᵀ.

    Kᵀ = [[1, 1],

    [0, 1]]

    S = QKᵀ = [[1, 0],

    [0, 1]] [[1, 1],

    [0, 1]]

    = [[1⋅1 + 0⋅0, 1⋅1 + 0⋅1],

    [0⋅1 + 1⋅0, 0⋅1 + 1⋅1]]

    = [[1, 1],

    [0, 1]]

  2. Step 2: Scale by √d_k. Here d_k = 2, so √d_k = √2.

    Ŝ = S / √2 = [[1/√2, 1/√2],

    [0/√2, 1/√2]]

    = [[0.7071, 0.7071],

    [0, 0.7071]] (approx)

  3. Step 3: Apply softmax row-wise to get attention weights A.

    Row 1: softmax([0.7071, 0.7071]) = [0.5, 0.5]

    Row 2: softmax([0, 0.7071])

    Compute exp values:

    exp(0) = 1

    exp(0.7071) ≈ 2.028

    Sum ≈ 3.028

    So row 2 ≈ [1/3.028, 2.028/3.028] ≈ [0.330, 0.670]

    Thus

    A ≈ [[0.5, 0.5],

    [0.33, 0.67]]

  4. Step 4: Compute output O = AV.

    O₁ = 0.5⋅[1,2] + 0.5⋅[3,4] = [2,3]

    O₂ = 0.33⋅[1,2] + 0.67⋅[3,4]

    = [0.33 + 2.01, 0.66 + 2.68]

    = [2.34, 3.34] (approx)

    So

    O ≈ [[2.00, 3.00],

    [2.34, 3.34]]

Insight: Each output token becomes a convex combination of value vectors. Token 1 averaged both tokens equally; token 2 leaned more heavily on token 2 because q₂ aligned better with k₂ than k₁.

Worked Example 2: Shapes and parameterization of multi-head attention (sanity-checking dimensions) #

Let d_model = 8, number of heads h = 2. Then per-head dimension d_k = d_v = d_model / h = 4. Let sequence length L = 3. We will track tensor shapes through MHA self-attention and the output projection.

  1. Step 1: Start with input X ∈ ℝ^{L×d_model} = ℝ^{3×8}.

  2. Step 2: Project into per-head Q, K, V.

    For each head r:

    W_Q^{(r)} ∈ ℝ^{8×4}, W_K^{(r)} ∈ ℝ^{8×4}, W_V^{(r)} ∈ ℝ^{8×4}.

    Thus:

    Q^{(r)} = XW_Q^{(r)} ∈ ℝ^{3×4}

    K^{(r)} = XW_K^{(r)} ∈ ℝ^{3×4}

    V^{(r)} = XW_V^{(r)} ∈ ℝ^{3×4}

  3. Step 3: Compute attention scores per head.

    For head r:

    S^{(r)} = Q^{(r)} (K^{(r)})ᵀ.

    Shapes: (3×4)(4×3) = 3×3.

    Scaling by √d_k keeps shape 3×3.

    Softmax row-wise yields A^{(r)} ∈ ℝ^{3×3}.

  4. Step 4: Mix values.

    O^{(r)} = A^{(r)} V^{(r)}.

    Shapes: (3×3)(3×4) = 3×4.

    So each head returns 3×4.

  5. Step 5: Concatenate heads.

    O_concat = [O^{(1)} | O^{(2)}] ∈ ℝ^{3×8}.

    Because concatenation along features gives 4 + 4 = 8.

  6. Step 6: Output projection.

    W_O ∈ ℝ^{8×8}.

    Y = O_concat W_O ∈ ℝ^{3×8}.

    So MHA maps ℝ^{L×d_model} → ℝ^{L×d_model}, enabling residual addition X + Y.

Insight: Most Transformer components are designed so inputs and outputs share the same shape (L×d_model). That single design choice makes deep stacking with residual connections straightforward.

Worked Example 3: Building a causal mask for a 4-token decoder self-attention #

We want a mask M for L = 4 such that position i can attend only to positions j ≤ i. We will express M as 0 for allowed and −∞ for disallowed entries, to be added to logits before softmax.

  1. Step 1: Write the allowed pattern.

    Row i shows which columns j are visible.

    i=1: can see [1]

    i=2: can see [1,2]

    i=3: can see [1,2,3]

    i=4: can see [1,2,3,4]

  2. Step 2: Create the matrix with −∞ above the diagonal.

    M =

    [[ 0, −∞, −∞, −∞],

    [ 0, 0, −∞, −∞],

    [ 0, 0, 0, −∞],

    [ 0, 0, 0, 0]]

  3. Step 3: Use it in attention.

    A = softmax( (QKᵀ)/√d_k + M )

    Because softmax(exp(−∞)) = 0, all forbidden future positions get exactly zero probability.

  4. Step 4: Add padding masking if needed.

    If token 4 were padding, you would also set the entire column j=4 to −∞ (except possibly where you want padding to never be attended at all).

    In practice, frameworks combine causal and padding masks by addition.

Insight: Masking is mathematically simple—just add −∞ before softmax—but conceptually essential: it encodes the information constraints that define the task (bidirectional understanding vs left-to-right generation).

Key Takeaways #

Common Mistakes #

Practice #

easy

Given L = 3 and d_k = 1, suppose Q = [[2],[0],[1]], K = [[1],[3],[−1]], V = [[10],[20],[30]]. Compute S = QKᵀ, then A = softmax(S) row-wise (no scaling needed because √d_k = 1), then O = AV.

Hint: Sᵢⱼ = qᵢ kⱼ. Compute each row’s softmax separately. Keep results as exact exponentials if you prefer: softmax([a,b,c]) = [eᵃ, eᵇ, eᶜ]/(eᵃ+eᵇ+eᶜ).

Show solution

S = QKᵀ gives a 3×3 matrix.

Row 1 (q₁=2): [2⋅1, 2⋅3, 2⋅(−1)] = [2, 6, −2]

Row 2 (q₂=0): [0, 0, 0]

Row 3 (q₃=1): [1, 3, −1]

So S = [[2,6,−2],[0,0,0],[1,3,−1]].

A row-wise:

Row 1: softmax([2,6,−2]) = [e², e⁶, e^{−2}] / (e² + e⁶ + e^{−2}).

Row 2: softmax([0,0,0]) = [1/3,1/3,1/3].

Row 3: softmax([1,3,−1]) = [e¹, e³, e^{−1}] / (e¹ + e³ + e^{−1}).

O = AV where V = [10,20,30]ᵀ applied as weighted sum per row:

O₁ = 10A₁₁ + 20A₁₂ + 30A₁₃

O₂ = (10+20+30)/3 = 20

O₃ = 10A₃₁ + 20A₃₂ + 30A₃₃

medium

You have a Transformer with d_model = 1024 and h = 16 heads. (a) What is d_k if you split evenly? (b) What are the shapes of W_Q, W_K, W_V per head? (c) What is the shape of the attention score matrix for a sequence length L = 128 in a single head?

Hint: Use d_k = d_model / h. Remember score matrix is QKᵀ with Q ∈ ℝ^{L×d_k} and K ∈ ℝ^{L×d_k}.

Show solution

(a) d_k = 1024 / 16 = 64.

(b) Per head, W_Q^{(r)} ∈ ℝ^{1024×64}, W_K^{(r)} ∈ ℝ^{1024×64}, W_V^{(r)} ∈ ℝ^{1024×64} (assuming d_v = d_k).

(c) For L = 128, Q and K are 128×64, so QKᵀ is (128×64)(64×128) = 128×128.

hard

Consider a decoder-only Transformer generating tokens left-to-right. Explain, using the attention formula, exactly where and how the causal mask changes the computation. Then describe what failure mode occurs if you omit the causal mask during training but keep it during inference.

Hint: Point to the logits matrix QKᵀ/√d_k and the additive mask M. Think about the model seeing future tokens during teacher forcing training.

Show solution

Causal masking modifies attention logits before softmax:

A = softmax( QKᵀ/√d_k + M_causal ).

M_causal has entries Mᵢⱼ = −∞ for j > i and 0 otherwise, forcing Aᵢⱼ = 0 for all future positions.

If you omit M_causal during training with teacher forcing, the model can attend to future ground-truth tokens to predict the next token. It learns a “cheating” solution that relies on information that will not be available at inference.

At inference, when you reintroduce the causal mask, those information paths disappear, causing a sharp performance drop: perplexity increases and generation quality degrades because the model’s learned dependencies are misaligned with the constraints at test time.

Connections #

Attention Mechanisms

Positional Encoding

Layer Normalization

Residual (Skip) Connections

Sequence Masking (causal and padding masks)

Softmax and Logits

Token Embeddings

Quality: A (4.3/5)

← back to treebrowse all →