Attention Mechanisms

←Back to Tech Tree

inventorycoverage

Attention Mechanisms #

Machine LearningDifficulty: ★★★★★Depth: 13Unlocks: 1

Weighted focus on input elements. Self-attention, cross-attention.

Interactive Visualization #

⏮◀◀▶▶STEP0.25x1xZOOM

t=0s

Core Concepts #

Key Symbols & Notation #

Q, K, V (query, key, value matrices or vectors)

Essential Relationships #

Prerequisites (8) #

Deep Learning6 atomsMatrix Calculus6 atomsSoftmax Function6 atomsCosine Similarity6 atomsVector Embeddings5 atomsSequence-to-Sequence Modeling5 atomsAffine Transformations (Linear Layers)5 atomsEmbeddings (Dense Representations)6 atoms

Unlocks (1) #

Transformerslvl 5

Advanced Learning Details

Graph Position #

248

Depth Cost

1

Fan-Out (ROI)

1

Bottleneck Score

13

Chain Length

Cognitive Load #

6

Atomic Elements

41

Total Elements

L3

Percentile Level

L4

Atomic Level

All Concepts (16) #

Teaching Strategy #

Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.

You’re building a machine translation system. The input is: “The animal didn’t cross the road because it was tired.” When generating “it”, the model must decide: does “it” refer to “animal” or “road”? In classic seq2seq, that decision is buried in a single hidden state bottleneck. Attention fixes this by letting the decoder look back and place a weighted focus over the relevant input tokens.

Now a curiosity gap: attention layers can fail in surprisingly silent ways. Two common ones: (1) applying softmax along the wrong axis (your model still trains, but attends across the batch or feature dimension), and (2) mask leakage (future tokens “peek” through due to broadcasting or dtype mistakes). This lesson makes the mechanism precise enough that you can derive the shapes, verify the axes, and catch these bugs quickly.

TL;DR:

Attention computes relevance between a query and many keys, converts relevance scores into weights (softmax), and uses those weights to blend the corresponding value vectors. Self-attention uses Q,K,V from the same sequence; cross-attention uses queries from one sequence (e.g., decoder) and keys/values from another (e.g., encoder). The core formula is: Attention(Q,K,V) = softmax(QKᵀ/√dₖ)V, with masking and batching details crucial in practice.

What Is Attention Mechanisms? #

Why attention exists (the bottleneck story) #

In sequence-to-sequence modeling, we often want an output sequence to depend on different parts of the input at different times. Translation, summarization, speech recognition, program synthesis—these tasks all have alignment structure:

Older encoder–decoder RNNs forced the entire input sequence into one fixed-size vector (or a narrow channel through the final hidden state). This creates an information bottleneck: long sequences degrade because the decoder can’t selectively retrieve what it needs.

Attention removes the bottleneck by turning “memory” into a set of vectors (one per input element) and letting the model compute a weighted combination of those vectors each time it needs context.

The three roles: Query, Key, Value #

Attention is easiest to understand by analogy to retrieval:

The algorithm:

  1. 1)Score how similar each key is to the query (relevance).
  2. 2)Convert scores into a probability distribution (weights).
  3. 3)Use weights to compute a weighted sum of values.

This is not just a metaphor; it’s literally what the math does.

A minimal single-query definition #

Suppose we have one query vector q ∈ ℝᵈ, and n keys/values {(kᵢ, vᵢ)} for i=1..n.

  1. Similarity scoring (dot-product attention):

si=q⊤kis_i = \mathbf{q}^\top \mathbf{k}_isi​=q⊤ki​

  1. Score-to-weight via softmax:

αi=exp⁡(si)∑j=1nexp⁡(sj)\alpha_i = \frac{\exp(s_i)}{\sum_{j=1}^n \exp(s_j)}αi​=∑j=1n​exp(sj​)exp(si​)​

  1. Aggregate values:

o=∑i=1nαi vi\mathbf{o} = \sum_{i=1}^n \alpha_i , \mathbf{v}_io=i=1∑n​αi​vi​

Here o is the attention output (sometimes called the “context vector”).

Why Q, K, V are usually learned projections #

In neural networks, the input tokens already have embeddings xᵢ. We project them into Q/K/V spaces with learned affine transformations:

This matters because:

Self-attention vs cross-attention (source origin distinction) #

This node emphasizes a crucial distinction:

In translation terms:

Preview: the axis and mask pitfalls #

Attention is easy to write but easy to implement incorrectly.

We’ll keep returning to shapes and axes so you can debug these confidently.

Core Mechanic 1: Similarity Scoring (Q·Kᵀ and why scaling matters) #

Why scoring is the heart of attention #

If attention is “weighted focus,” then the score function decides what counts as relevant. The score is computed between a query and each key.

In practice, the most common scoring rule is dot-product similarity because it is fast on GPUs and works well with learned projections.

From one query to many queries: matrix form #

Assume we have:

Compute all pairwise query–key scores:

S=QK⊤S = QK^\topS=QK⊤

Shapes:

Interpretation:

This “score matrix” is the object you will mask, normalize, and use to weight V.

Why the scaling factor 1/√dₖ exists #

In Transformers, the standard formula is scaled dot-product attention:

S=QK⊤dkS = \frac{QK^\top}{\sqrt{d_k}}S=dk​​QK⊤​

Motivation: dot products grow in magnitude with dimension.

A rough variance argument:

So typical score magnitudes scale like √dₖ. Large magnitudes push softmax into saturation:

Dividing by √dₖ keeps the score distribution more stable as dₖ changes.

Alternative similarity scoring functions #

Dot product is not the only option. Historically, early attention used additive scoring.

Scoring typeFormula (single pair)ProsCons
Dot-products = qkFast, simple, GPU-friendlyCan grow with dₖ (needs scaling)
Cosine similaritys = (qk) / (q
Additive (Bahdanau)s = wᵀ tanh(W_q q + W_k k)Flexible, can work well with smaller dimsSlower; less parallel-friendly

Because you already know cosine similarity: note that dot-product attention can learn to behave like cosine similarity if the model learns to normalize representations (or learns norm control via layer norm / projection matrices). But in standard Transformers, the scaling is the main explicit normalization.

Shape discipline (the first line of defense against bugs) #

When implementing scoring, always write down:

In a batched setting:

A common silent bug: transposing the wrong axes so you compute (B×dₖ×dₖ) or normalize over the wrong dimension.

Causal structure and “who can look at whom” #

The score matrix S encodes potential connections:

This is done not at the Q/K/V level but by masking the score matrix before softmax.

We’ll treat masking carefully in the next mechanic because it interacts directly with the probability distribution.

Core Mechanic 2: Score-to-Weight (Softmax), Masking, and Weighted Aggregation #

Why we need a distribution, not raw scores #

Raw scores S_{ij} are unbounded real numbers. To create a “focus,” we need nonnegative weights that sum to 1 across keys for each query.

Softmax does exactly this, turning each query’s score row into a categorical distribution over keys.

The core formula (matrix form) #

Given scores:

S=QK⊤dk∈Rm×nS = \frac{QK^\top}{\sqrt{d_k}} \quad\in \mathbb{R}^{m\times n}S=dk​​QK⊤​∈Rm×n

Compute attention weights:

A=softmax⁡(S)A = \operatorname{softmax}(S)A=softmax(S)

Important: softmax is applied row-wise over the key dimension (size n). That means:

Aij=exp⁡(Sij)∑t=1nexp⁡(Sit)A_{ij} = \frac{\exp(S_{ij})}{\sum_{t=1}^{n} \exp(S_{it})}Aij​=∑t=1n​exp(Sit​)exp(Sij​)​

Then aggregate values:

O=AVO = AVO=AV

Shapes:

Interpretation:

Masking: forbidding attention to certain positions #

Masking modifies S before softmax so forbidden positions get probability ≈ 0.

Two common masks:

  1. Padding mask (ignore pad tokens)
  1. Causal mask (prevent “future” access in autoregressive decoding)

Mechanically, we add a large negative number to masked scores:

S′=S+MS' = S + MS′=S+M

Where M_{ij} = 0 if allowed, and M_{ij} = -\infty (or a large negative constant like -10^9) if disallowed.

Then:

A=softmax⁡(S′)A = \operatorname{softmax}(S')A=softmax(S′)

Because exp(-∞) → 0, masked entries get weight 0.

The surprising failure mode: mask leakage via broadcasting #

Masks are often stored with shape (B×1×1×n) or (B×1×m×n) depending on implementation (especially with multi-head attention).

A common bug pattern:

Result: you mask the wrong dimension or the wrong positions. The model may still train but exhibits “cheating” (decoder sees future) or ignores padding improperly.

Practical discipline:

Softmax axis mistake (the other silent bug) #

Given S of shape (B×m×n):

If you accidentally normalize over queries, you enforce that each key distributes probability over queries, which is not the retrieval interpretation.

A quick invariant check:

Numerical stability: subtract max #

Softmax can overflow if scores are large. The standard stable computation:

For each row i:

Aij=exp⁡(Sij−max⁡tSit)∑t=1nexp⁡(Sit−max⁡uSiu)A_{ij} = \frac{\exp(S_{ij} - \max_t S_{it})}{\sum_{t=1}^{n} \exp(S_{it} - \max_u S_{iu})}Aij​=∑t=1n​exp(Sit​−maxu​Siu​)exp(Sij​−maxt​Sit​)​

This doesn’t change results because subtracting a constant from all logits preserves softmax.

Temperature and sharpness #

Sometimes you’ll see a temperature τ:

A=softmax⁡(S/τ)A = \operatorname{softmax}(S/\tau)A=softmax(S/τ)

The Transformer’s √dₖ scaling can be interpreted as a kind of dimension-dependent temperature.

Weighted sum as linear algebra (and why it’s differentiable) #

Once you have A, the output is:

O=AVO = AVO=AV

This is a linear combination of V with coefficients from A.

Differentiability:

To see the dependency explicitly for a single query i:

oi=∑j=1nAij vj\mathbf{o}_i = \sum_{j=1}^n A_{ij}, \mathbf{v}_joi​=j=1∑n​Aij​vj​

If A_{ij} increases, oᵢ moves toward vⱼ.

A useful mental model: attention is “content-addressable memory” #

Keys provide an address space, queries pick addresses, values store content. The softmax makes it a soft (continuous) lookup rather than a hard index.

This is why attention can represent alignment: it’s literally learning a soft alignment matrix A.

At this point, you have the atomic concepts:

Next we connect that to the self vs cross distinction in full architectural context.

Application/Connection: Self-Attention vs Cross-Attention (and how this becomes Transformers) #

Why the “origin of Q,K,V” matters #

Attention is a general operator: it maps (Q,K,V) to O. The difference between self- and cross-attention is simply where these tensors come from.

This origin choice encodes a modeling decision:

Self-attention: mixing information inside one sequence #

Let X ∈ ℝ^(L×d_model) be a sequence of L token embeddings (after adding positional information).

We compute:

Q=XWQ,K=XWK,V=XWVQ = XW_Q, \quad K = XW_K, \quad V = XW_VQ=XWQ​,K=XWK​,V=XWV​

Where:

Then:

SelfAttn⁡(X)=softmax⁡(QK⊤dk+M)V\operatorname{SelfAttn}(X) = \operatorname{softmax}\left(\frac{QK^\top}{\sqrt{d_k}} + M\right)VSelfAttn(X)=softmax(dk​​QK⊤​+M)V

Mask M depends on the setting:

Interpretation: each token representation is updated by blending information from other tokens.

A key property: self-attention can connect tokens at arbitrary distance in one step (unlike RNNs where information must travel sequentially).

Cross-attention: querying one sequence with another #

In an encoder–decoder setup:

Cross-attention uses:

Q=YWQ,K=HWK,V=HWVQ = YW_Q, \quad K = HW_K, \quad V = HW_VQ=YWQ​,K=HWK​,V=HWV​

So each decoder position forms a query based on what it has generated so far, and retrieves relevant source information.

Shape intuition:

This matrix is literally an alignment between target positions and source positions.

Where multi-head attention fits (high-level, because Transformers unlock next) #

This node unlocks Transformers, where attention is typically multi-head.

Multi-head attention repeats the attention computation h times with different learned projections:

For head r:

Each head produces O_r, then we concatenate and project:

O=Concat⁡(O1,…,Oh)WOO = \operatorname{Concat}(O_1, \dots, O_h) W_OO=Concat(O1​,…,Oh​)WO​

Why multiple heads helps:

But the atomic mechanism remains exactly what you learned: score, softmax, weighted sum.

Deriving the batched, multi-head shapes (to prevent axis errors) #

Let:

Typical shapes:

Scores:

S=QK⊤dk⇒(B×h×Lq×Lk)S = \frac{QK^\top}{\sqrt{d_k}} \quad \Rightarrow \quad (B\times h\times L_q\times L_k)S=dk​​QK⊤​⇒(B×h×Lq​×Lk​)

Softmax over L_k:

Output:

This is where the earlier failure modes live:

A concrete debugging checklist (practical connection) #

When attention behaves oddly, check invariants:

  1. Row sum invariant (per query):
  1. Mask invariant:
  1. Causality invariant (decoder self-attention):
  1. Sanity input test:

How attention connects to the next node (Transformers) #

Transformers stack attention layers with:

But none of those change what attention is. They make it trainable, stable, and expressive at scale.

If you can derive the score matrix shape and explain why softmax is row-wise, you’re ready to understand multi-head attention, positional encoding, and full Transformer blocks.

Worked Examples (3) #

Worked Example 1: Compute attention weights and output by hand (single query) #

We have 1 query and 3 key/value pairs. Use unscaled dot-product attention for simplicity.

Let q = [1, 0].

Keys:

Values:

Compute scores sᵢ = qkᵢ, weights α via softmax, and output o = ∑ αᵢ vᵢ.

  1. Step 1: Compute dot-product scores

    s₁ = [1,0]·[1,0] = 1

    s₂ = [1,0]·[0,1] = 0

    s₃ = [1,0]·[1,1] = 1

  2. Step 2: Softmax normalization

    Compute exp scores:

    exp(s₁)=e¹,

    exp(s₂)=e⁰=1,

    exp(s₃)=e¹

    Sum = e + 1 + e = 2e + 1

    So:

    α₁ = e/(2e+1)

    α₂ = 1/(2e+1)

    α₃ = e/(2e+1)

  3. Step 3: Weighted sum of values

    o = α₁v₁ + α₂v₂ + α₃v₃

    = α₁[10,0] + α₂[0,10] + α₃[5,5]

    First component:

    o₁ = 10α₁ + 0α₂ + 5α₃ = 10α₁ + 5α₃

    Second component:

    o₂ = 0α₁ + 10α₂ + 5α₃ = 10α₂ + 5α₃

  4. Step 4: Substitute α values

    Because α₁ = α₃ = e/(2e+1):

    o₁ = 10·e/(2e+1) + 5·e/(2e+1) = 15e/(2e+1)

    o₂ = 10·1/(2e+1) + 5·e/(2e+1) = (10 + 5e)/(2e+1)

Insight: Even though k₁ and k₃ tie on relevance, the output is not just “pick one”: it blends v₁ and v₃ heavily, with a smaller contribution from v₂. Attention is a soft retrieval mechanism; ties and near-ties naturally produce mixtures.

Worked Example 2: Self-attention vs cross-attention shapes (and where softmax must apply) #

You have an encoder–decoder model.

Encoder sequence length L_src = 4, decoder length L_tgt = 3.

Model dimension d_model = 8.

Single-head attention with dₖ = dᵥ = 8.

Batch size B = 2.

Encoder outputs H have shape (B×L_src×d_model) = (2×4×8).

Decoder representations Y have shape (B×L_tgt×d_model) = (2×3×8).

Construct Q,K,V and determine the score matrix shape for:

  1. encoder self-attention

  2. decoder self-attention

  3. decoder cross-attention

  1. Part A: Encoder self-attention

    Q = HW_Q, K = HW_K, V = HW_V

    So Q,K,V each have shape (2×4×8).

    Scores S = QKᵀ:

    • •Q is (2×4×8)
    • •Kᵀ (over last two dims) is (2×8×4)

    So S is (2×4×4).

    Softmax must be over the last dimension (keys), so over size 4.

  2. Part B: Decoder self-attention

    Q,K,V come from Y, so each is (2×3×8).

    Scores S is (2×3×3).

    Softmax over the last dimension (keys) so each of 3 query positions has a distribution over 3 key positions.

    Additionally, apply a causal mask so query position t cannot attend to keys > t.

  3. Part C: Decoder cross-attention

    Q comes from Y: Q is (2×3×8).

    K,V come from H: K,V are (2×4×8).

    Scores S = QKᵀ gives shape (2×3×4).

    Softmax must be over the last dimension (keys), so over size 4 (the source positions).

    Padding mask applies to the encoder keys (length 4), not to decoder positions.

Insight: The single biggest implementation detail is: softmax normalizes across keys for each query. In cross-attention the key axis is L_src, not L_tgt. If you normalize over the wrong length, the model no longer expresses “which source tokens explain this target token?”

Worked Example 3: Causal masking prevents future leakage (tiny matrix demonstration) #

Consider decoder self-attention with L = 3 tokens. We want token 1 (0-indexed) to attend only to keys 0..1.

Suppose scaled scores (already divided by √dₖ) for a single head and single batch item are:

S =

[ [2, 1, 0],

[0, 3, 4],

[1, 1, 1] ]

Apply a causal mask and compute the masked softmax weights for row 1 (the second query).

  1. Step 1: Write the causal mask M (0 allowed, -∞ forbidden)

    For L=3:

    M =

    [ [0, -∞, -∞],

    [0, 0, -∞],

    [0, 0, 0] ]

  2. Step 2: Mask the scores S' = S + M

    Row 1 (second query) originally: [0, 3, 4]

    After masking (disallow key 2): [0, 3, -∞]

  3. Step 3: Softmax row 1 stably

    Compute max = 3

    Subtract max: [0-3, 3-3, -∞] = [-3, 0, -∞]

    Exponentiate: [e^-3, 1, 0]

    Normalize: sum = e^-3 + 1

    So weights are:

    A = [ e^-3/(1+e^-3), 1/(1+e^-3), 0 ]

Insight: Without the mask, key 2 would dominate because score 4 is largest. With the mask, its probability is forced to 0. This illustrates why mask correctness is a security property for autoregressive models: a single broadcasting error can re-enable that last entry.

Key Takeaways #

Common Mistakes #

Practice #

easy

You have Q ∈ ℝ^(5×16), K ∈ ℝ^(7×16), V ∈ ℝ^(7×32). What are the shapes of the score matrix S, attention weights A, and output O (single head, no batch)? Also: along which axis do you apply softmax?

Hint: Compute S = QKᵀ and track dimensions; softmax should normalize over keys for each query.

Show solution

S = QKᵀ has shape (5×7). A = softmax(S) has shape (5×7), with softmax applied over the last dimension of size 7 (the keys) for each of the 5 queries. O = AV has shape (5×32).

medium

Consider a decoder self-attention layer with sequence length L=4. Write the causal mask matrix M (entries 0 or −∞) that prevents attending to future tokens. Which entries are allowed for query position 2 (0-indexed)?

Hint: Allowed positions are keys with index ≤ query index.

Show solution

For L=4,

M =

[ [0, −∞, −∞, −∞],

[0, 0, −∞, −∞],

[0, 0, 0, −∞],

[0, 0, 0, 0] ]

For query position 2, allowed keys are {0,1,2}; key 3 is forbidden.

hard

In cross-attention, a decoder has L_tgt=6 and an encoder has L_src=10. You compute scores S of shape (B×h×6×10). Suppose you accidentally apply softmax over the length-6 axis instead of length-10. Conceptually, what distribution are you computing, and why is it wrong for retrieval?

Hint: Ask: for a fixed query, do weights sum across keys? Or across queries?

Show solution

Softmax over the length-6 axis normalizes across queries (target positions) for each fixed key, producing a distribution like “how much does this source position contribute across different target queries,” rather than “which source positions are relevant for this target query.” Retrieval requires, for each query position, a distribution over keys (length 10). With the wrong axis, each query no longer forms a proper mixture over encoder values, so the mechanism can’t represent alignment from each target token to source tokens.

Connections #

Unlocks and extensions:

Related prerequisites and reinforcing nodes:

Quality: A (4.3/5)

← back to treebrowse all →