Affine Transformations (Linear Layers)

←Back to Tech Tree

inventorycoverage

Affine Transformations (Linear Layers) #

Linear AlgebraDifficulty: ★★★☆☆Depth: 0Unlocks: 4

An affine transformation applies a linear map (matrix multiply) followed by a bias shift; in neural models this corresponds to learned linear layers that project inputs into query/key/value spaces. Recognizing affine transforms helps understand how attention inputs are linearly combined and projected.

Interactive Visualization #

⏮◀◀▶▶STEP0.25x1xZOOM

t=0s

Core Concepts #

Key Symbols & Notation #

W (weight matrix)b (bias vector)

Essential Relationships #

Unlocks (3) #

Attention Mechanismslvl 5Embeddings (Dense Representations)lvl 4Sequence Masking (causal and padding masks)lvl 4

Advanced Learning Details

Graph Position #

5

Depth Cost

4

Fan-Out (ROI)

1

Bottleneck Score

0

Chain Length

Cognitive Load #

5

Atomic Elements

32

Total Elements

L1

Percentile Level

L3

Atomic Level

All Concepts (13) #

Teaching Strategy #

Self-serve tutorial - low prerequisites, straightforward concepts.

Every time a Transformer turns token vectors into queries, keys, and values, it’s doing the same fundamental operation: take an input vector, mix its components with a matrix, then shift the result with a bias. That simple “mix then shift” move—an affine transformation—is the workhorse behind linear layers.

TL;DR:

An affine transformation maps x ↦ Wx + b. The matrix W performs a linear map (rotation/scale/shear/projection and component mixing), and the bias b translates (shifts) the output. In neural networks, this is a learned linear layer used to project embeddings into new spaces (like Q/K/V in attention).

What Is an Affine Transformation (Linear Layer)? #

Why this concept exists #

In many systems you want a controllable way to transform a vector of features into a new vector of features. In machine learning, you repeatedly need to:

  1. 1)Combine input features into new features (weighted sums)
  2. 2)Recenter or shift the output (so “zero input” doesn’t force “zero output”)

A linear map does (1). A bias/translation does (2). Together they form an affine transformation.

Definition #

An affine transformation from ℝⁿ to ℝᵐ is a function of the form:

where:

In neural-network language, this is a linear layer (often called “fully connected”), even though mathematically it’s affine unless b = 0.

Intuition: “mix then shift” #

Think of x as a column of numbers (features). Multiplying by W creates weighted sums of those features—each output component is a mixture of all input components.

Then adding b shifts the result by a constant offset independent of x.

Linear vs affine (the key distinction) #

A linear map L(x) = Wx has a special property:

But an affine map A(x) = Wx + b generally does not:

So the bias is exactly what lets the model output something nonzero even when the input is zero.

Geometry: what affine transforms preserve #

Affine transformations preserve straight lines and parallelism. They do not necessarily preserve angles or lengths.

A useful mental model:

Shapes (dimensions) matter #

In ML you’ll constantly track dimensions. Here’s the standard setup:

A compact dimension check:

This one rule prevents many mistakes later when you build attention projections.

Core Mechanic 1 — The Linear Map: Matrix Multiplication as Weighted Sums #

Why start with the matrix part? #

If you strip away the bias, the matrix multiplication Wx is the mixing engine of a linear layer. It’s how models learn to combine features: emphasize some, suppress others, and create new features from old ones.

Row view: each output is a dot product #

Let W have rows w₁ᵀ, w₂ᵀ, …, wₘᵀ (each wᵢ ∈ ℝⁿ). Then:

So each output component is a dot product between the input and a learned weight vector.

Write this explicitly:

This is why people say a linear layer computes “weighted sums”: each yᵢ is a sum of input components xⱼ multiplied by weights.

Component form: the summation you’ll see in derivations #

If W has entries Wᵢⱼ, then:

This shows two important things:

  1. 1)Each output coordinate can depend on all input coordinates.
  2. 2)The weights Wᵢⱼ are exactly “how much does xⱼ contribute to yᵢ?”.

Column view: the output is a linear combination of columns #

Let W’s columns be c₁, …, cₙ (each cⱼ ∈ ℝᵐ). Then:

So the input scalars xⱼ decide how much of each column vector cⱼ is added.

This is a powerful geometric view:

Mixing features: why matrices are more than per-feature scaling #

A diagonal matrix scales each coordinate independently:

But a full matrix creates new features by mixing:

In representation learning, this mixing is essential: the model can rotate into a coordinate system where some later operation (like attention scoring) becomes easier.

A small but crucial property: linearity #

For L(x) = Wx:

You can verify by algebra:

L(αu + βv)

= W(αu + βv)

= αWu + βWv

= αL(u) + βL(v)

This matters conceptually: the linear part preserves the “add and scale” structure of vectors.

When does W change dimensionality? #

Affine/linear layers are often used to change feature dimension:

Goaln → mInterpretation
Compressionlarge n → small mprojection / bottleneck
Expansionsmall n → large mlift into richer feature space
Same sizen → nrotation/scale/shear/mixing

In Transformers, projections often keep the model dimension d the same (d → d) but also create multiple heads (conceptually splitting into h subspaces). Even when the final dimension is the same, W is still doing a learned change of basis.

Core Mechanic 2 — The Bias: Translation and Changing the “Default Output” #

Why do we add a bias at all? #

If you only have y = Wx, then the output is forced to be 0 when x = 0. That’s not always desirable.

In ML terms: without a bias, the model can only represent functions that pass through the origin. A bias lets the model set a baseline output.

Definition and immediate consequence #

An affine layer is:

Evaluate at x = 0:

So b is the output the layer produces when given zero input.

Geometry: translation #

The map x ↦ Wx transforms the space around the origin. Adding b then shifts every output by the same vector.

If two inputs differ by Δx:

Subtract:

Notice b cancels. This reveals an important geometric fact:

So W controls how differences are transformed; b controls where the transformed cloud sits.

Bias as an extra feature (homogeneous coordinates idea) #

A useful trick is to rewrite the affine map as a pure matrix multiply by augmenting the input with a 1.

Create an extended vector and matrix:

Then:

= [ W b ] [ x ; 1 ]

= Wx + b·1

= Wx + b

Why this is conceptually helpful:

Bias and decision boundaries (quick ML connection) #

Even before deep learning, linear models use biases.

A linear classifier might compute:

The set where s(x) = 0 is a hyperplane:

If b = 0, the hyperplane must pass through the origin. With b ≠ 0, it can shift, greatly increasing what you can represent.

What about bias in Transformers? #

Many Transformer implementations include bias terms in linear projections, though some variants remove them for efficiency or symmetry (and compensate elsewhere). Conceptually, knowing that b exists helps you interpret a projection as:

Even if a specific architecture sets b = 0, the affine framework is still the general concept.

Application/Connection — Affine Layers in Attention (Q, K, V Projections) and Embeddings #

Why affine transformations show up in attention #

Attention needs vectors in roles that are not identical:

Even if all tokens start as embeddings in the same space ℝᵈ (model dimension d), the model benefits from learning different projections for these different roles.

The standard projections #

Given a token representation x ∈ ℝᵈ, attention uses learned affine maps:

where W_Q, W_K, W_V ∈ ℝᵈˣᵈ (often) and biases are in ℝᵈ.

With sequences, you apply this to every position. If X ∈ ℝˡˣᵈ is a matrix whose rows are token vectors, then:

Here 1 ∈ ℝˡ is a vector of ones. The important idea is: the same affine transform is applied independently to each token vector.

Multi-head attention as multiple affine projections #

In multi-head attention with h heads, each head often uses a smaller per-head dimension d_head where d = h·d_head.

One way to view this:

Another view (equivalent conceptually):

Either way, the key point is that attention relies on learned affine maps to create multiple learned “views” of the same input.

Why affine (not just linear) matters for interpretation #

Suppose you compare two tokens x₁ and x₂. Their query difference is:

So the bias does not change relative geometry, but it does change the absolute location. In dot-product attention, absolute location can matter because dot products are not translation-invariant:

This is one reason biases can subtly affect attention score distributions.

Connecting to embeddings #

Embeddings give you dense vectors e(token) ∈ ℝᵈ. On their own they are just coordinates. Affine layers are how the model:

In practice, a Transformer block is largely a sequence of affine maps plus nonlinearities and normalization. Recognizing “Wx + b” everywhere helps you read architectures without getting lost.

Connection to masking (preview) #

Masking affects which attention scores are allowed, but the scores themselves come from dot products of affine-projected vectors:

So masking is applied after affine projections have created Q and K. Understanding affine projections helps you see that masking doesn’t change how Q/K/V are computed; it changes which pairings (i,j) are considered.

Summary table: where affine transforms appear in a Transformer #

ComponentTypical formPurpose
Q projectionq = W_Qx + b_Qprepare “search vectors”
K projectionk = W_Kx + b_Kprepare “address vectors”
V projectionv = W_Vx + b_Vprepare “content vectors”
Output projectiono = W_Oz + b_Omix heads back together
Feed-forward layer 1h = W₁x + bexpand dimension
Feed-forward layer 2y = W₂φ(h) + bcompress back

Once you can fluently interpret each row of W as “a learned feature detector” and b as “a learned baseline,” the architecture becomes much more transparent.

Worked Examples (3) #

Example 1 — Compute an affine transform and interpret the result #

Let x ∈ ℝ² be x = [2; −1]. Let W ∈ ℝ²ˣ² and b ∈ ℝ² be:

W = [[1, 3],

[−2, 0]]

b = [4; 1]

Compute y = Wx + b, and interpret each output coordinate as a weighted sum plus bias.

  1. Start with y = Wx + b.

  2. Compute Wx using row-by-row dot products.

    First row of W is w₁ᵀ = [1, 3].

    So (Wx)₁ = [1, 3] · [2; −1]

    = 1·2 + 3·(−1)

    = 2 − 3

    = −1.

  3. Second row of W is w₂ᵀ = [−2, 0].

    So (Wx)₂ = [−2, 0] · [2; −1]

    = (−2)·2 + 0·(−1)

    = −4 + 0

    = −4.

  4. So Wx = [−1; −4].

  5. Add the bias b:

    y = Wx + b

    = [−1; −4] + [4; 1]

    = [3; −3].

Insight: Each output is a learned weighted sum of inputs plus a learned offset. Here y₁ = 1·x₁ + 3·x₂ + 4 and y₂ = (−2)·x₁ + 0·x₂ + 1. The matrix mixes features; the bias shifts the baseline.

Example 2 — Show that bias cancels in differences, but affects absolute dot products #

Consider an affine projection used for queries: q = Wx + b. Take two inputs x₁ and x₂. (1) Derive q₂ − q₁. (2) Show how a shared bias can still affect a dot-product score qk when both sides have biases.

  1. Write the two projected queries:

    q₁ = Wx₁ + b

    q₂ = Wx₂ + b

  2. Subtract:

    q₂ − q

    = (Wx₂ + b) − (Wx₁ + b)

    = Wx₂ + b − Wx₁ − b

    = W(x₂ − x₁).

  3. So the bias b does not affect differences between projected vectors; it only shifts them together.

  4. Now consider keys also have a bias: k = Ux + c.

    A dot-product score between a query and a key is:

    score = qk

    = (Wx + b)ᵀ (Ux' + c).

  5. Expand the dot product carefully:

    (Wx + b)ᵀ (Ux' + c)

    = (Wx)ᵀ(Ux') + (Wx)ᵀc + bᵀ(Ux') + bc.

  6. Even though biases cancel in differences, they introduce extra terms in absolute dot products: (Wx)ᵀc, bᵀ(Ux'), and bc.

Insight: Bias doesn’t change relative geometry (differences), but attention scoring depends on absolute dot products, so biases can shift score distributions via additional cross-terms. This is one reason architectural choices about bias can matter in practice.

Example 3 — Rewrite an affine map as a single matrix multiplication (homogeneous trick) #

Let W ∈ ℝ³ˣ² and b ∈ ℝ³ define y = Wx + b. Construct an augmented matrix W̄ and augmented vector so that y = W̄ with no explicit + b.

  1. Start with y = Wx + b, where x ∈ ℝ² and y ∈ ℝ³.

  2. Augment the input by appending 1:

    = [ x ; 1 ] ∈ ℝ³.

  3. Create the augmented matrix by appending b as an extra column:

    W̄ = [ W b ] ∈ ℝ³ˣ³.

  4. Multiply:

    = [ W b ] [ x ; 1 ]

    = Wx + b·1

    = Wx + b

    = y.

Insight: Bias can be treated as weights on a constant feature. This is handy for reasoning and for deriving gradients: affine maps are linear in their parameters.

Key Takeaways #

Common Mistakes #

Practice #

easy

Let x = [1; 2; −1] ∈ ℝ³, W = [[2, 0, 1], [−1, 3, 2]] ∈ ℝ²ˣ³, and b = [0; 5] ∈ ℝ². Compute y = Wx + b.

Hint: Compute Wx by row dot products, then add b.

Show solution

Wx:

First row: [2,0,1]·[1;2;−1] = 2·1 + 0·2 + 1·(−1) = 2 − 1 = 1

Second row: [−1,3,2]·[1;2;−1] = (−1)·1 + 3·2 + 2·(−1) = −1 + 6 − 2 = 3

So Wx = [1; 3].

Add b: y = [1;3] + [0;5] = [1;8].

medium

Suppose f(x) = Wx + b with W ∈ ℝᵐˣⁿ. Prove that for any u, v ∈ ℝⁿ and scalar α, the following holds: fu + (1−α)v) = αf(u) + (1−α)f(v).

Hint: Expand both sides using distributivity of matrix multiplication; watch how the bias terms combine.

Show solution

Left side:

fu + (1−α)v) = W(αu + (1−α)v) + b

= αWu + (1−α)Wv + b.

Right side:

αf(u) + (1−α)f(v)

= α(Wu + b) + (1−α)(Wv + b)

= αWu + αb + (1−α)Wv + (1−α)b

= αWu + (1−α)Wv + (α + 1−α)b

= αWu + (1−α)Wv + b.

Both sides match, so the identity holds. (This is a defining “affine” property: it preserves convex combinations.)

medium

You have a Transformer with model dimension d = 512 and number of heads h = 8. If per-head dimension is d_head = 64, what are the typical shapes of W_Q, W_K, W_V for the combined projection (single matrix per type), and what is the shape of the per-token bias b_Q?

Hint: Combined projections usually map ℝᵈ → ℝᵈ, then reshape into (h, d_head).

Show solution

Since d = h·d_head = 8·64 = 512, a common design is:

W_Q ∈ ℝᵈˣᵈ = ℝ⁵¹²ˣ⁵¹² (and similarly W_K, W_V).

The bias b_Q is added to each token’s projected query vector, so b_Q ∈ ℝᵈ = ℝ⁵¹².

After computing q = W_Qx + b_Q, the result in ℝ⁵¹² is reshaped/split into 8 heads of size 64.

Connections #

Next nodes you can unlock and why they rely on affine maps:

Quality: A (4.6/5)

← back to treebrowse all →