Projections

←Back to Tech Tree

inventorycoverage

Projections #

Linear AlgebraDifficulty: ★★★☆☆Depth: 6Unlocks: 5

Projecting vectors onto subspaces. Least squares.

Interactive Visualization #

⏮◀◀▶▶STEP0.25x1xZOOM

t=0s

Core Concepts #

Key Symbols & Notation #

Projection matrix P = A (A^T A)^{-1} A^T (when A has full column rank and S = Col(A)).Normal equations: A^T A x_hat = A^T b (the linear system whose solution x_hat gives the least-squares fit).

Essential Relationships #

Prerequisites (2) #

Orthogonality5 atomsDot Product5 atoms

Unlocks (1) #

Linear Regressionlvl 3

Advanced Learning Details

Graph Position #

50

Depth Cost

5

Fan-Out (ROI)

1

Bottleneck Score

6

Chain Length

Cognitive Load #

6

Atomic Elements

39

Total Elements

L2

Percentile Level

L4

Atomic Level

All Concepts (15) #

Teaching Strategy #

Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.

When data doesn’t fit your model perfectly, you don’t “solve” Ax = b—you choose the best approximation. Projections are the geometric rule that turns “best approximation” into a precise, computable object.

TL;DR:

Projecting a vector v onto a subspace S means finding the unique u ∈ S that minimizes ‖vu‖. For S = Col(A) with full column rank, the orthogonal projection is p = Pb, where P = A(AᵀA)⁻¹Aᵀ. Least squares chooses that makes A the projection of b onto Col(A), leading to the normal equations AᵀA = Aᵀb.

What Is a Projection? #

Why projections matter (motivation) #

In linear algebra, a subspace S is a “legal set of directions.” Often, the vector you have—call it v—doesn’t lie inside S. That mismatch happens constantly:

In all these cases, you want a principled way to replace v with a nearby vector u that does lie in S. The key idea is: choose the closest one.

Definition (orthogonal projection) #

Let S ⊂ ℝⁿ be a subspace, and let v ∈ ℝⁿ. The orthogonal projection of v onto S, written proj_S(v), is the unique vector u ∈ S that minimizes the distance to v:

minimize over u ∈ S: ‖vu‖.

This is a geometric “drop a perpendicular” idea generalized from 2D/3D to ℝⁿ.

The orthogonality condition (what makes it “orthogonal”) #

The minimizer isn’t just any closest point. It satisfies a very specific condition:

Let p = proj_S(v). Define the residual (error) r = vp.

Then:

Meaning r is orthogonal to every vector in S:

This is the characterization that makes projections computable.

Uniqueness (why there is exactly one best point) #

Because S is a subspace (hence closed and convex) and the objective ‖vu‖² is strictly convex in u, there is exactly one minimizer.

Geometrically: the set of points in S is “flat,” and the distance function has a single lowest point when you look from v toward that flat set.

Decomposing a vector into parallel + perpendicular parts #

Orthogonal projection gives a clean decomposition:

So projection is not merely “closest point.” It is a structured split into a component inside the subspace and a component orthogonal to it.

A quick sanity check: 1D subspace (a line) #

If S is the span of a nonzero vector a, then the projection is the familiar formula:

If a is unit length (‖a‖ = 1), this simplifies to:

The general subspace case is this idea repeated, but with multiple basis directions at once.

Core Mechanic 1: Orthogonal Projections via Geometry and Conditions #

Why we need a computable rule #

The definition “pick the closest u ∈ S” is conceptually perfect, but it doesn’t yet tell you how to compute u. The trick is to translate “closest” into an equation.

Instead of minimizing ‖vu‖ directly, minimize the squared norm (same minimizer, easier algebra):

The key idea: the residual is orthogonal to the subspace #

Let p ∈ S be the minimizer. Consider any direction s ∈ S (a feasible direction to move within the subspace). If you nudge p inside the subspace by ts, the distance to v cannot decrease at t = 0 (otherwise p wasn’t minimal).

Compute the derivative:

f(p + ts) = ‖v − (p + ts)‖²

= ‖(vp) − ts‖²

= ((vp) − ts)ᵀ((vp) − ts)

= ‖vp‖² − 2t sᵀ(vp) + t²‖s‖².

At a minimum, derivative at t = 0 is 0:

d/dt f(p + ts) |_{t=0} = −2 sᵀ(vp) = 0.

So:

which is exactly:

This is the most important practical fact: the residual is orthogonal to the model subspace.

Using a basis: turning “orthogonal to S” into linear equations #

Suppose S is spanned by k linearly independent vectors a₁, …, aₖ. Put them into a matrix A ∈ ℝⁿˣᵏ as columns:

Any vector in S looks like Ax for some x ∈ ℝᵏ.

We want p = Ax that makes r = v − Ax orthogonal to S.

Being orthogonal to S is equivalent to being orthogonal to every column of A:

Stack these k equations:

⇒ AᵀAx = Aᵀv.

These are the normal equations in the language of least squares (we’ll return to that). For projection, they are simply the orthogonality conditions written in matrix form.

Orthonormal basis special case (fast mental model) #

If the columns of Q ∈ ℝⁿˣᵏ are orthonormal (QᵀQ = I), then projection is especially simple.

Let S = Col(Q). We want p ∈ S of the form p = Qc.

Use the orthogonality condition:

Qᵀ(v − Qc) = 0

⇒ Qᵀv − QᵀQc = 0

⇒ Qᵀv − Ic = 0

c = Qᵀv.

So:

Interpretation:

  1. Compute coordinates c = Qᵀv (dot products against basis vectors).

  2. Rebuild the vector in the subspace: Qc.

This is the cleanest picture of what projection does.

Pythagorean identity (what you gain from orthogonality) #

With v = p + r and pr:

v‖² = ‖p‖² + ‖r‖².

This is often used to reason about error. Projection chooses p to make ‖r‖ as small as possible, so it also makes ‖p‖ as large as possible among all vectors in S that could approximate v.

Core Mechanic 2: Projection as a Linear Map (Projection Matrices) #

Why introduce a projection matrix? #

If you will project many different vectors onto the same subspace S, you want a reusable operator. Projections are not just geometric constructions; they are linear transformations.

For a fixed subspace S ⊂ ℝⁿ, the orthogonal projection is a function P: ℝⁿ → ℝⁿ such that:

When written in standard coordinates, this function is multiplication by an n×n matrix, also called P.

Column-space projection: P = A(AᵀA)⁻¹Aᵀ #

Assume A ∈ ℝⁿˣᵏ has full column rank (its columns are independent). Then AᵀA is invertible.

Given b ∈ ℝⁿ, the projection of b onto Col(A) is p = A where solves the normal equations:

AᵀA = Aᵀb

= (AᵀA)⁻¹Aᵀb.

Plug back into p:

p = A

= A((AᵀA)⁻¹Aᵀb)

= (A(AᵀA)⁻¹Aᵀ)b.

So the projection matrix is:

What P does (and why it’s special) #

For an orthogonal projection matrix P onto S:

  1. Idempotent: P² = P

Apply the projection twice: once you’re in the subspace, projecting again does nothing.

  1. Symmetric: Pᵀ = P

This encodes “orthogonal” in matrix form.

  1. Range and null space:

So P keeps the S component and kills the orthogonal component.

Verifying idempotence (show the work) #

Let P = A(AᵀA)⁻¹Aᵀ. Then:

P² = [A(AᵀA)⁻¹Aᵀ][A(AᵀA)⁻¹Aᵀ]

= A(AᵀA)⁻¹(AᵀA)(AᵀA)⁻¹Aᵀ

= A(I)(AᵀA)⁻¹Aᵀ

= A(AᵀA)⁻¹Aᵀ

= P.

The step (AᵀA)⁻¹(AᵀA) = I works because A has full column rank.

Verifying symmetry (show the work) #

Pᵀ = [A(AᵀA)⁻¹Aᵀ]ᵀ

= (Aᵀ)ᵀ[(AᵀA)⁻¹]ᵀAᵀ

= A[(AᵀA)⁻¹]Aᵀ

= P,

because AᵀA is symmetric, so (AᵀA)⁻¹ is also symmetric.

Orthonormal columns shortcut: P = QQᵀ #

If Q has orthonormal columns (QᵀQ = I), then:

P = Q(QᵀQ)⁻¹Qᵀ = QIQᵀ = QQᵀ.

This is why orthonormal bases are so valuable: they make projection formulas simpler and numerically more stable.

Eigenvalues intuition (optional but useful) #

From P² = P, any eigenvalue λ satisfies λ² = λ, so λ ∈ {0, 1}.

So P is literally a “keep these directions, erase those directions” operator.

Application/Connection: Least Squares and the Normal Equations #

Why least squares is projection in disguise #

Consider a system Ax = b. If b ∈ Col(A), there is an exact solution. But if b is not in Col(A), there is no x that makes Ax equal to b.

Least squares replaces “solve exactly” with “get as close as possible”:

This is the same geometry as projection:

So the least-squares fitted vector A is:

Deriving the normal equations (show the work) #

Minimize the squared error:

f(x) = ‖Axb‖²

= (Axb)ᵀ(Axb).

Expand:

f(x) = xᵀAᵀAx − 2xᵀAᵀb + bb.

Take gradient and set to zero:

∇f(x) = 2AᵀAx − 2Aᵀb = 0

⇒ AᵀA = Aᵀb.

These are the normal equations.

Geometric meaning: the residual r = b − A is orthogonal to Col(A):

Aᵀ(b − A) = 0.

That is exactly the orthogonality condition for projection.

When does the formula (AᵀA)⁻¹Aᵀ work? #

If A has full column rank, AᵀA is invertible and:

If A does not have full column rank, there are infinitely many least-squares minimizers; then one typically uses the pseudoinverse A⁺ (outside this node’s core scope, but good to know).

Residual properties you can rely on #

Let p = A and r = bp.

This gives a stable mental model: least squares chooses the decomposition of b into “explained by the model subspace” plus “leftover orthogonal noise.”

Connection to linear regression #

In linear regression, A is the design matrix, x is the parameter vector, and b (often written y) is the target output. The fitted predictions A are the projection of y onto Col(A).

That single sentence unifies:

Worked Examples (3) #

Project **v** onto a line span(**a**) #

Let a = (2, 1) and v = (1, 3). Let S = span(a) ⊂ ℝ². Compute proj_S(v) and the residual r.

  1. Use the line projection formula:

    proj_S(v) = ((av) / (aa)) a.

  2. Compute dot products:

    av = (2, 1)·(1, 3) = 2·1 + 1·3 = 5.

    aa = (2, 1)·(2, 1) = 2² + 1² = 5.

  3. Compute the scalar coefficient:

    (av) / (aa) = 5/5 = 1.

  4. Compute the projection:

    proj_S(v) = 1·a = (2, 1).

  5. Compute the residual:

    r = v − proj_S(v) = (1, 3) − (2, 1) = (−1, 2).

  6. Check orthogonality to the subspace (it suffices to check ⟂ to a):

    ar = (2, 1)·(−1, 2) = 2(−1) + 1(2) = 0.

    So r ⟂ span(a), as required.

Insight: The closest point on a line is found by matching the component of v along the line direction and discarding the perpendicular component. The residual must be orthogonal to the line—this condition is what generalizes to higher-dimensional subspaces.

Projection onto Col(A) using P = A(AᵀA)⁻¹Aᵀ #

Let A = [[1],[1],[0]] (a 3×1 matrix) and b = (2, 0, 1). Compute the projection of b onto Col(A) and verify P² = P.

  1. Interpretation:

    Col(A) is the line in ℝ³ spanned by a = (1, 1, 0). We will compute P explicitly.

  2. Compute AᵀA:

    AᵀA = [1 1 0]·[1; 1; 0] = 1² + 1² + 0² = 2.

    So (AᵀA)⁻¹ = 1/2.

  3. Compute P:

    P = A(AᵀA)⁻¹Aᵀ

    = 1; 1; 0[1 1 0]

    = (1/2)

    [[1·1, 1·1, 1·0],

    [1·1, 1·1, 1·0],

    [0·1, 0·1, 0·0]]

    = (1/2)

    [[1, 1, 0],

    [1, 1, 0],

    [0, 0, 0]].

  4. Project b:

    p = Pb

    = (1/2)

    [[1, 1, 0],

    [1, 1, 0],

    [0, 0, 0]]

    [2; 0; 1]

    = (1/2)

    [1·2 + 1·0 + 0·1;

    1·2 + 1·0 + 0·1;

    0]

    = (1/2)[2; 2; 0]

    = (1, 1, 0).

  5. Residual:

    r = bp = (2, 0, 1) − (1, 1, 0) = (1, −1, 1).

    Check orthogonality to Col(A):

    ar = (1, 1, 0)·(1, −1, 1) = 1 − 1 + 0 = 0.

  6. Verify idempotence (compute P²):

    First note P = (1/2)M with M = [[1,1,0],[1,1,0],[0,0,0]].

    Compute M²:

    M² = [[1,1,0],[1,1,0],[0,0,0]]·[[1,1,0],[1,1,0],[0,0,0]]

    = [[2,2,0],[2,2,0],[0,0,0]]

    = 2M.

    So P² = (1/2)² M² = (1/4)(2M) = (1/2)M = P.

Insight: The matrix P acts as a reusable operator: any b is decomposed into p ∈ Col(A) and r ⟂ Col(A). The algebraic identities P² = P and Pᵀ = P encode the geometric facts “projecting twice changes nothing” and “the residual is orthogonal.”

Least squares fit as a projection (normal equations) #

Fit b = (1, 2, 2) with a model Ax where A has two columns:

A = [[1,0],[1,1],[1,2]].

Compute the least-squares solution and the fitted vector p = A.

  1. Set up the normal equations:

    AᵀA = Aᵀb.

  2. Compute AᵀA.

    Aᵀ = [[1,1,1],[0,1,2]].

    So:

    AᵀA = [[1,1,1],[0,1,2]]·[[1,0],[1,1],[1,2]]

    = [[1+1+1, 0+1+2],

    [0+1+2, 0²+1²+2²]]

    = [[3, 3],

    [3, 5]].

  3. Compute Aᵀb:

    Aᵀb = [[1,1,1],[0,1,2]]·[1;2;2]

    = [1+2+2;

    0·1 + 1·2 + 2·2]

    = [5;

    6].

  4. Solve [[3,3],[3,5]] = [5,6].

    Let = (x₁, x₂).

    Equations:

    3x₁ + 3x₂ = 5

    3x₁ + 5x₂ = 6

    Subtract first from second:

    (3x₁ + 5x₂) − (3x₁ + 3x₂) = 6 − 5

    2x₂ = 1 ⇒ x₂ = 1/2.

    Plug back:

    3x₁ + 3(1/2) = 5

    3x₁ + 3/2 = 5

    3x₁ = 5 − 3/2 = 10/2 − 3/2 = 7/2

    x₁ = 7/6.

  5. Compute fitted vector p = A:

    A = [[1,0],[1,1],[1,2]]·[7/6; 1/2]

    = [7/6 + 0;

    7/6 + 1/2;

    7/6 + 2·(1/2)]

    = [7/6;

    7/6 + 3/6;

    7/6 + 6/6]

    = [7/6;

    10/6;

    13/6]

    = (7/6, 5/3, 13/6).

  6. Compute residual r = bp:

    r = (1,2,2) − (7/6, 5/3, 13/6)

    = (6/6−7/6, 6/3−5/3, 12/6−13/6)

    = (−1/6, 1/3, −1/6).

    Check orthogonality to Col(A) by verifying Aᵀr = 0:

    Aᵀr = [[1,1,1],[0,1,2]]·[−1/6; 1/3; −1/6]

    = [−1/6 + 1/3 − 1/6;

    0·(−1/6) + 1·(1/3) + 2·(−1/6)]

    = [0;

    1/3 − 2/6]

    = [0;

    0].

Insight: Least squares chooses so that the residual is orthogonal to every column of A. That orthogonality is exactly the geometric condition for projection onto Col(A).

Key Takeaways #

Common Mistakes #

Practice #

easy

Let a = (1, −2, 2) and v = (2, 1, 0). Compute proj_span(a)(v) and the residual r. Verify ar = 0.

Hint: Use proj_span(a)(v) = ((av) / (aa)) a.

Show solution

av = (1,−2,2)·(2,1,0) = 2 − 2 + 0 = 0.

aa = 1² + (−2)² + 2² = 1 + 4 + 4 = 9.

Coefficient = 0/9 = 0.

Projection = 0·a = (0,0,0).

Residual r = v0 = (2,1,0).

Check: ar = av = 0, so orthogonality holds.

medium

Let A = [[1,1],[1,0],[0,1]] and b = (1,2,3). (i) Compute solving AᵀA = Aᵀb. (ii) Compute p = A. (iii) Compute r = bp and verify Aᵀr = 0.

Hint: Compute AᵀA (2×2) and Aᵀb (2×1), solve the 2×2 system, then form p and r.

Show solution

Aᵀ = [[1,1,0],[1,0,1]].

AᵀA = [[1,1,0],[1,0,1]]·[[1,1],[1,0],[0,1]]

= [[1²+1²+0², 1·1+1·0+0·1],

[1·1+0·1+1·0, 1²+0²+1²]]

= [[2, 1],

[1, 2]].

Aᵀb = [[1,1,0],[1,0,1]]·[1;2;3] = [1+2+0; 1+0+3] = [3;4].

Solve [[2,1],[1,2]] [x₁;x₂] = [3;4].

From 2x₁ + x₂ = 3 ⇒ x₂ = 3 − 2x₁.

Plug into x₁ + 2x₂ = 4:

x₁ + 2(3 − 2x₁) = 4 ⇒ x₁ + 6 − 4x₁ = 4 ⇒ −3x₁ = −2 ⇒ x₁ = 2/3.

Then x₂ = 3 − 2(2/3) = 3 − 4/3 = 5/3.

So = (2/3, 5/3).

Compute p = A:

A = [[1,1],[1,0],[0,1]]·[2/3; 5/3]

= [2/3+5/3; 2/3; 5/3]

= [7/3; 2/3; 5/3].

Residual r = bp = [1;2;3] − [7/3;2/3;5/3]

= [3/3−7/3; 6/3−2/3; 9/3−5/3]

= [−4/3; 4/3; 4/3].

Verify Aᵀr:

Aᵀr = [[1,1,0],[1,0,1]]·[−4/3; 4/3; 4/3]

= [−4/3 + 4/3 + 0; −4/3 + 0 + 4/3] = [0;0].

hard

Suppose P is an orthogonal projection matrix in ℝⁿ (so P² = P and Pᵀ = P). Show that for any v, the residual r = v − Pv is orthogonal to every vector in Range(P).

Hint: Take an arbitrary y in Range(P). Write y = Px for some x. Compute yr and use symmetry/idempotence.

Show solution

Let y ∈ Range(P). Then ∃ x such that y = Px.

Let r = v − Pv.

Compute:

yr = (Px)ᵀ(v − Pv)

= xᵀPᵀ(v − Pv) (move P using transpose)

= xᵀP(v − Pv) (since Pᵀ = P)

= xᵀ(Pv − P²v)

= xᵀ(Pv − Pv) (since P² = P)

= x0

= 0.

So yr = 0 for all y ∈ Range(P), meaning r ⟂ Range(P).

Connections #

Next up: projections are the geometric backbone of least squares regression.

Related reinforcement nodes you may have seen:

Quality: A (4.6/5)

← back to treebrowse all →