Linear Regression

←Back to Tech Tree

inventorycoverage

Linear Regression #

Machine LearningDifficulty: ★★★☆☆Depth: 9Unlocks: 4

Predicting continuous values. Least squares, normal equations.

Interactive Visualization #

⏮◀◀▶▶STEP0.25x1xZOOM

t=0s

Core Concepts #

Key Symbols & Notation #

X (design matrix: examples x features, include column of ones for intercept)beta (parameter/coefficients vector)y (target/response vector of observed continuous values)

Essential Relationships #

Prerequisites (2) #

Machine Learning Introduction5 atomsProjections6 atoms

Unlocks (1) #

Time Series Foundationslvl 3

Referenced by (1) #

Where this concept shows up in the operating-finance and personal-finance graphs.

From Business (1) #

[real estateBusiness

Hedonic pricing models (price = f(sqft, beds, location, age)) are the canonical RE valuation tool - automated valuation models are literally multivariate regression](/business/real-estate/)

Advanced Learning Details

Graph Position #

150

Depth Cost

4

Fan-Out (ROI)

1

Bottleneck Score

9

Chain Length

Cognitive Load #

6

Atomic Elements

39

Total Elements

L2

Percentile Level

L4

Atomic Level

All Concepts (14) #

Teaching Strategy #

Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.

Linear regression is the “hello world” of predicting numbers: you pick a linear rule, compare its predictions to real outcomes, and adjust the rule to make the squared errors as small as possible. Under the hood, it’s an elegant geometry problem about projecting y onto the column space of X.

TL;DR:

Linear regression predicts a continuous target by ŷ = Xβ. We fit β by minimizing the sum of squared residuals ‖yXβ‖². The solution satisfies the normal equations XXβ = Xy (when XX is invertible) and geometrically corresponds to projecting y onto the subspace spanned by the columns of X.

What Is Linear Regression? #

The goal #

In supervised learning for continuous outputs, you want a model that maps an input feature vector to a real number:

Linear regression is the simplest widely useful choice:

The linear predictive model #

For one example with features x, linear regression predicts

ŷ = β₀ + β₁x₁ + β₂x₂ + … + β_d x_d

To keep notation clean, we fold the intercept into the feature vector by adding a constant feature:

Then the prediction becomes a dot product:

ŷ = xᵀβ

From one example to a dataset #

Suppose you have n training examples. Stack them into a design matrix X:

Also stack targets into a vector:

Predictions for all examples at once:

ŷ = Xβ

This compact form is the main reason linear regression is such a good “bridge” topic: it connects machine learning, linear algebra, geometry, and optimization.

Why the squared error? #

You need a rule for choosing β. Linear regression chooses β to make predictions close to y.

Define residuals:

r = yXβ

A natural idea is to minimize total error. But we need a scalar objective. Squared error is the classic choice:

SSE(β) = ∑ᵢ (yᵢ − ŷᵢ)² = ‖yXβ‖²

Why square?

What you should keep in mind #

Linear regression is not “the true relationship is linear.” It’s:

You can include nonlinear transformations as features (e.g., x², log x), and the model is still linear in β.

Core Mechanic 1: Least-Squares Fitting (Minimizing ‖y − Xβ‖²) #

The optimization problem #

Least squares fitting means choosing β to minimize squared residual length:

minimize over β: J(β) = ‖yXβ‖²

This is a geometry problem wearing an optimization costume.

So the fitted predictions Xβ̂ are the orthogonal projection of y onto Col(X).

Expanding the objective (so we can take derivatives) #

Write J(β) as a quadratic form:

J(β) = (yXβ)ᵀ(yXβ)

Expand carefully:

J(β)

= (yᵀ − βᵀXᵀ)(yXβ)

= yyyXβ − βᵀXy + βᵀXXβ

Note that yXβ is a scalar, and equals its transpose:

yXβ = ( yXβ )ᵀ = βᵀXy

So the middle two terms combine:

J(β) = yy − 2βᵀXy + βᵀXXβ

This makes it clear J(β) is a convex quadratic function in β.

The gradient and the normal equations #

To minimize J(β), set its gradient to zero.

Using standard matrix calculus results:

∂/∂β (βᵀAβ) = (A + Aᵀ)β, and when A is symmetric: = 2Aβ

Here A = XX, which is symmetric.

Compute the gradient:

∇β J(β)

= ∇β( yy − 2βᵀXy + βᵀXXβ )

= 0 − 2Xy + 2XXβ

Set ∇β J(β) = 0:

2XXβ − 2Xy = 0

Divide by 2:

XXβ = Xy

These are the normal equations.

Solving for β̂ #

If XX is invertible (full column rank), then

β̂ = (XX)⁻¹ Xy

This is the classic closed-form least squares solution.

If XX is not invertible (e.g., perfectly collinear features), there are infinitely many solutions that achieve the same minimum SSE. A common choice is the minimum-norm solution via the Moore–Penrose pseudoinverse:

β̂ = Xy

At the optimum, residual r = yXβ̂ is orthogonal to the column space of X.

In algebra:

Xᵀ(yXβ̂) = 0

which rearranges to the normal equations.

So you can read the normal equations as:

That is exactly the condition for an orthogonal projection.

What least squares is really doing #

Least squares chooses β to balance tradeoffs across all points.

This is why linear regression is both:

Core Mechanic 2: Intercept, Geometry, and When the Normal Equations Behave #

The intercept as a constant feature #

Including an intercept means adding a column of ones:

X = [ 1 | feature columns ]

This matters because without an intercept, the model is forced to go through the origin in feature space (in 1D, the line must pass through (0,0)). In real data, that constraint is often unjustified and increases error.

Shapes and dimensions (so you don’t get lost) #

It helps to keep a mental “type system” for linear regression:

If you ever feel stuck, check dimensions first.

Geometry: column space and projections #

The set of all predictions the model can make is

{ Xβ : β ∈ ℝᵖ } = Col(X)

That is a subspace of ℝⁿ (or an affine subspace if you treat the intercept differently; with the column-of-ones trick, it’s still a linear subspace in ℝⁿ).

The fitted predictions are

ŷ = Xβ̂ = Proj_{Col(X)}(y)

The residual vector

r = yŷ

is orthogonal to Col(X).

When does (XX)⁻¹ exist? #

(XX) is invertible iff the columns of X are linearly independent (full column rank). Practically, it can fail for several reasons:

Even if it exists, it can be numerically unstable if XX is ill-conditioned.

Practical numerical note: don’t literally invert #

Although the formula β̂ = (XX)⁻¹Xy is conceptually important, computing the inverse explicitly is usually a bad idea numerically.

Common stable approaches:

ApproachIdeaProsCons
QR decompositionX = Q****R, solve Rβ = QyStable, standardMore linear algebra machinery
SVD / pseudoinverseX = UΣVBest for rank-deficient casesSlower
Gradient-based optimizationminimize J(β) iterativelyScales, works with huge dataNeeds learning rate / iterations

This lesson focuses on the conceptual math; just remember that implementations often use QR/SVD under the hood.

Centering and scaling (feature preprocessing) #

Least squares itself doesn’t require scaling, but conditioning and interpretation improve when you:

Centering also interacts with the intercept:

A note on “linear” #

Linear regression is linear in parameters β, not necessarily linear in raw input.

Example: If you use features [1, x, x²], predictions are

ŷ = β₀ + β₁x + β₂x²

This is a quadratic curve in x, but still a linear regression model because it is a linear combination of features with coefficients β.

Application/Connection: Prediction, Evaluation, and Why It’s a Foundation for ML #

Using the model to predict #

After fitting β̂, predictions for a new example xₙₑw are

ŷₙₑw = xₙₑwᵀ β̂

For a batch of new examples Xₙₑw:

ŷₙₑw = Xₙₑw β̂

Interpreting coefficients (carefully) #

If feature xⱼ increases by 1 (holding other features fixed), the prediction changes by βⱼ.

That “holding fixed” clause is crucial: in correlated data, changing one feature while keeping others constant may describe a scenario that doesn’t naturally occur.

Common evaluation metrics #

Since the training objective is SSE, related metrics are common:

R² as “variance explained” (intuition) #

Define:

Then

R² = 1 − SSE/SST

Interpretation:

Be cautious: R² always (weakly) increases when you add features, even useless ones. This is one reason regularization (like ridge regression) matters later.

Overfitting and generalization #

Linear regression can still overfit when:

Least squares will chase patterns in the sample even if they are not stable.

The usual fix is to evaluate on held-out data (validation/test) and/or add regularization.

Why linear regression is foundational #

Linear regression is a gateway to many later ideas:

If you understand linear regression deeply—especially the projection view—you’re set up to understand a lot of ML with much less memorization.

Worked Examples (3) #

Fit a line in 1D with an intercept (compute β̂ via normal equations) #

Data: (x, y) = (0, 1), (1, 2), (2, 2). Use an intercept by setting features x = [1, x]ᵀ. Compute β̂ = [β₀, β₁]ᵀ.

  1. Build the design matrix X (n = 3, p = 2):

    X = [ [1, 0],

    [1, 1],

    [1, 2] ]

  2. Build the target vector:

    y = [1, 2, 2]ᵀ

  3. Compute XX:

    XX = [ [1,1,1],

    [0,1,2] ] · [ [1,0],

    [1,1],

    [1,2] ]

    = [ [3, 3],

    [3, 5] ]

  4. Compute Xy:

    Xy = [ [1,1,1],

    [0,1,2] ] · [1,2,2]ᵀ

    = [ 1+2+2,

    0·1 + 1·2 + 2·2 ]ᵀ

    = [5, 6]ᵀ

  5. Solve the normal equations (XX)β = Xy:

    [ [3, 3],

    [3, 5] ] · [β₀, β₁]ᵀ = [5, 6]ᵀ

  6. Solve the linear system:

    From 3β₀ + 3β₁ = 5 ⇒ β₀ + β₁ = 5/3

    From 3β₀ + 5β₁ = 6

    Subtract the first equation (scaled by 3):

    (3β₀ + 5β₁) − (3β₀ + 3β₁) = 6 − 5

    2β₁ = 1 ⇒ β₁ = 1/2

    Then β₀ = 5/3 − 1/2 = 10/6 − 3/6 = 7/6

  7. Write the fitted model:

    ŷ = β₀ + β₁x = 7/6 + (1/2)x

  8. Compute fitted values and residuals:

    For x=0: ŷ=7/6, residual = 1 − 7/6 = −1/6

    For x=1: ŷ=7/6+1/2=7/6+3/6=10/6=5/3, residual = 2 − 5/3 = 1/3

    For x=2: ŷ=7/6+1=13/6, residual = 2 − 13/6 = −1/6

Insight: Even in this tiny dataset, the best line is not the one that hits the most points exactly; it’s the one whose residual vector has minimal length ‖r‖ and is orthogonal to the column space of X.

Verify the orthogonality condition **X**ᵀ(**y** − **X**β̂) = **0** for the fitted line #

Use the previous example’s β̂ with X and y. Compute residual r = yXβ̂ and check Xr = 0.

  1. Recall:

    X = [ [1,0],

    [1,1],

    [1,2] ], y = [1,2,2]ᵀ, β̂ = [7/6, 1/2]ᵀ

  2. Compute predictions ŷ = Xβ̂:

    Row 1: [1,0]·β̂ = 7/6

    Row 2: [1,1]·β̂ = 7/6 + 1/2 = 5/3

    Row 3: [1,2]·β̂ = 7/6 + 1 = 13/6

    So ŷ = [7/6, 5/3, 13/6]ᵀ

  3. Compute residuals r = yŷ:

    r = [1 − 7/6,

    2 − 5/3,

    2 − 13/6]ᵀ

    = [−1/6,

    1/3,

    −1/6]ᵀ

  4. Compute Xr:

    Xr = [ [1,1,1],

    [0,1,2] ] · [−1/6, 1/3, −1/6]ᵀ

  5. First component (dot with column of ones):

    (−1/6) + (1/3) + (−1/6)

    = −1/6 + 2/6 − 1/6

    = 0

  6. Second component (dot with x column):

    0·(−1/6) + 1·(1/3) + 2·(−1/6)

    = 1/3 − 2/6

    = 2/6 − 2/6

    = 0

  7. Therefore Xᵀ(yXβ̂) = Xr = 0, confirming the projection/normal-equation condition.

Insight: The normal equations are not arbitrary algebra: they encode the geometric fact that the residual is perpendicular to every feature column, meaning you can’t move within the model’s subspace to get any closer to y.

Multiple features: compute β̂ for a tiny 2-feature design matrix #

Let X include an intercept and two features:

X = [ [1, 1, 0],

[1, 0, 1],

[1, 1, 1] ],

y = [1, 2, 2]ᵀ.

Compute β̂ by solving XXβ = Xy.

  1. Write columns of X:

    Intercept column c₀ = [1,1,1]ᵀ

    Feature 1 column c₁ = [1,0,1]ᵀ

    Feature 2 column c₂ = [0,1,1]ᵀ

  2. Compute XX using dot products of columns:

    XX = [ [c₀ᵀc₀, c₀ᵀc₁, c₀ᵀc₂],

    [c₁ᵀc₀, c₁ᵀc₁, c₁ᵀc₂],

    [c₂ᵀc₀, c₂ᵀc₁, c₂ᵀc₂] ]

  3. Compute each dot product:

    c₀ᵀc₀ = 1+1+1 = 3

    c₀ᵀc₁ = 1+0+1 = 2

    c₀ᵀc₂ = 0+1+1 = 2

    c₁ᵀc₁ = 1+0+1 = 2

    c₁ᵀc₂ = (1·0)+(0·1)+(1·1) = 1

    c₂ᵀc₂ = 0+1+1 = 2

  4. So:

    XX = [ [3, 2, 2],

    [2, 2, 1],

    [2, 1, 2] ]

  5. Compute Xy by dotting columns with y:

    Xy = [ c₀ᵀy, c₁ᵀy, c₂ᵀy ]ᵀ

  6. Compute:

    c₀ᵀy = 1+2+2 = 5

    c₁ᵀy = 1·1 + 0·2 + 1·2 = 3

    c₂ᵀy = 0·1 + 1·2 + 1·2 = 4

    So Xy = [5, 3, 4]ᵀ

  7. Solve the system:

    [ [3, 2, 2],

    [2, 2, 1],

    [2, 1, 2] ] · [β₀, β₁, β₂]ᵀ = [5, 3, 4]ᵀ

  8. One solution path (elimination):

    From equation (2): 2β₀ + 2β₁ + β₂ = 3

    From equation (3): 2β₀ + β₁ + 2β₂ = 4

    Subtract (3) − (2): (2β₀ cancels)

    (β₁ + 2β₂) − (2β₁ + β₂) = 4 − 3

    (−β₁ + β₂) = 1 ⇒ β₂ = β₁ + 1

  9. Use equation (1): 3β₀ + 2β₁ + 2β₂ = 5

    Substitute β₂:

    3β₀ + 2β₁ + 2(β₁ + 1) = 5

    3β₀ + 4β₁ + 2 = 5

    3β₀ + 4β₁ = 3

  10. Use equation (2): 2β₀ + 2β₁ + β₂ = 3

    Substitute β₂ = β₁ + 1:

    2β₀ + 2β₁ + (β₁ + 1) = 3

    2β₀ + 3β₁ + 1 = 3

    2β₀ + 3β₁ = 2

  11. Solve the 2×2 system:

    3β₀ + 4β₁ = 3

    2β₀ + 3β₁ = 2

    Multiply the second by 3: 6β₀ + 9β₁ = 6

    Multiply the first by 2: 6β₀ + 8β₁ = 6

    Subtract: (6β₀+9β₁) − (6β₀+8β₁) = 6 − 6

    β₁ = 0

    Then 2β₀ + 3·0 = 2 ⇒ β₀ = 1

    Then β₂ = β₁ + 1 = 1

  12. So β̂ = [1, 0, 1]ᵀ and the model is:

    ŷ = 1 + 0·x₁ + 1·x₂ = 1 + x₂

Insight: Even with multiple features, the workflow is the same: build X, compute XX and Xy, then solve. Thinking in terms of column dot products makes XX feel less mysterious.

Key Takeaways #

Common Mistakes #

Practice #

easy

Given data points (x, y) = (1, 1), (2, 2), (3, 2). Fit a 1D linear regression with intercept using normal equations. Report β̂ and the predicted values ŷ for x = 1,2,3.

Hint: Use X = [[1,1],[1,2],[1,3]] and β̂ = (XX)⁻¹Xy (or solve the 2×2 normal equations directly).

Show solution

Let X = [ [1,1], [1,2], [1,3] ] and y = [1,2,2]ᵀ.

Compute XX = [ [3, 6], [6, 14] ].

Compute Xy = [ 1+2+2, 1·1+2·2+3·2 ]ᵀ = [5, 11]ᵀ.

Solve [ [3,6],[6,14] ] [β₀,β₁]ᵀ = [5,11]ᵀ.

From 3β₀+6β₁=5 and 6β₀+14β₁=11.

Multiply the first by 2: 6β₀+12β₁=10. Subtract from second: 2β₁=1 ⇒ β₁=1/2.

Then 3β₀+6(1/2)=5 ⇒ 3β₀+3=5 ⇒ β₀=2/3.

So β̂ = [2/3, 1/2]ᵀ.

Predictions: x=1: ŷ=2/3+1/2=7/6; x=2: ŷ=2/3+1=5/3; x=3: ŷ=2/3+3/2=13/6.

medium

Show that at the least-squares optimum β̂, the residual r = yXβ̂ is orthogonal to every column of X. (Derive the condition, don’t just state it.)

Hint: Start from J(β) = ‖yXβ‖². Expand and take ∇β, then set it equal to 0.

Show solution

Let J(β) = (yXβ)ᵀ(yXβ).

Expand:

J(β) = yy − 2βᵀXy + βᵀXXβ.

Take gradient:

∇β J(β) = −2Xy + 2XXβ.

At optimum β̂: ∇β J(β̂) = 0XXβ̂ = Xy.

Rearrange:

XyXXβ̂ = 0

Xᵀ(yXβ̂) = 0

Xr = 0.

This means each column cⱼ of X satisfies cⱼᵀr = 0, i.e., r is orthogonal to every feature column.

hard

Suppose you have two features where the second is exactly double the first for all examples (x₂ = 2x₁), and you include an intercept. What does this imply about XX and the uniqueness of β̂? What would you do in practice?

Hint: Think about linear dependence among columns of X and what that means for invertibility. Consider pseudoinverse or removing a redundant feature.

Show solution

If x₂ = 2x₁ for all examples, then the column for feature 2 equals 2 times the column for feature 1. Therefore the columns of X are linearly dependent, so X does not have full column rank. Then XX is singular (not invertible).

Least squares still has minimizers, but β̂ is not unique: many coefficient vectors produce the same predictions Xβ (because changing β₁ and β₂ along the dependent direction leaves Xβ unchanged).

In practice you can (1) drop one of the redundant features, restoring full rank; (2) use a pseudoinverse solution β̂ = Xy (which selects a minimum-norm β̂ among all minimizers); or (3) add regularization (e.g., ridge regression) to make the problem well-posed.

Connections #

Quality: A (4.4/5)

← back to treebrowse all →