Norms

←Back to Tech Tree

inventorycoverage

Norms #

Linear AlgebraDifficulty: ★★☆☆☆Depth: 3Unlocks: 1

Vector length/magnitude. L1, L2 (Euclidean), Linf norms.

Interactive Visualization #

⏮◀◀▶▶STEP0.25x1xZOOM

t=0s

Core Concepts #

Key Symbols & Notation #

||v||_p (norm of vector v; subscript p denotes p-norm)

Essential Relationships #

Prerequisites (1) #

Dot Product5 atoms

Unlocks (1) #

Clusteringlvl 4

Advanced Learning Details

Graph Position #

23

Depth Cost

1

Fan-Out (ROI)

1

Bottleneck Score

3

Chain Length

Cognitive Load #

6

Atomic Elements

26

Total Elements

L0

Percentile Level

L4

Atomic Level

All Concepts (11) #

Teaching Strategy #

Deep-dive lesson - accessible entry point but dense material. Use worked examples and spaced repetition.

If you can measure “how far” a vector is from zero, you can compare directions, choose the nearest point, control model complexity, and reason about geometry. Norms are the standard way to do that—and different norms create different notions of distance and “closeness.”

TL;DR:

A norm ‖v‖ assigns a nonnegative length to a vector, with three rules: (1) it’s zero only for the zero vector, (2) scaling scales length by |α|, and (3) it obeys the triangle inequality. The most common are ‖v‖₁ (sum of absolute values), ‖v‖₂ (Euclidean), and ‖v‖∞ (max absolute component). Different norms change geometry and behavior in algorithms like clustering.

What Is a Norm? #

Why we need “length” beyond pictures #

In 2D or 3D, “length” feels obvious: draw an arrow from the origin to a point and measure how long it is. But in computer science and machine learning you constantly work in higher-dimensional spaces:

In those settings, you still need a rigorous way to answer:

A norm is the mathematical object that turns these questions into consistent computations.

Definition (with intuition) #

A norm is a function that maps a vector to a nonnegative real number:

‖·‖ : ℝⁿ → ℝ

It must satisfy three properties for all vectors u, v and all scalars α:

  1. Positive-definiteness

Intuition: length can’t be negative; the only vector with zero length is the zero vector.

  1. Absolute homogeneity

‖αv‖ = |α| ‖v

Intuition: scaling an arrow by 3 makes it 3× longer; scaling by −3 flips direction but length is still 3×.

  1. Triangle inequality

u + v‖ ≤ ‖u‖ + ‖v

Intuition: taking two steps (first u, then v) can’t be shorter than the straight-line shortcut by more than the total step lengths.

Norms vs dot products (and why you need both) #

You said you already know the dot product. Great—because in ℝⁿ the Euclidean norm is closely tied to it:

v‖₂ = √(v · v)

But norms are more general than dot products: you can have norms that don’t come from dot products (like ‖·‖₁ and ‖·‖∞). That flexibility is useful: different norms encode different ideas of what it means to be “close,” “large,” or “small.”

From norms to distances #

A norm automatically defines a distance (a metric) between two vectors:

d(x, y) = ‖xy

So if you choose a norm, you also choose a geometry for your space.

Common p-norms (preview) #

The most used family is the p-norms:

v‖ₚ = (∑ᵢ |vᵢ|ᵖ)¹ᐟᵖ, for p ≥ 1

Special cases you’ll use constantly:

Each is a valid norm, satisfies the three properties, and leads to a different notion of “ball” and “nearest.”

Core Mechanic 1: Computing L1, L2, and L∞ Norms #

Why these three show up everywhere #

When you implement algorithms, you want norms that are:

‖·‖₁, ‖·‖₂, and ‖·‖∞ are the standard trio because they emphasize different aspects of a vector:

Formulas (and what they “measure”) #

Let v = (v₁, v₂, …, vₙ).

L1 norm

v‖₁ = ∑ᵢ |vᵢ|

Interpretation: add up the absolute contributions of every coordinate.

L2 (Euclidean) norm

v‖₂ = √(∑ᵢ vᵢ²) = √(v · v)

Interpretation: the usual straight-line length.

L∞ norm

v‖∞ = maxᵢ |vᵢ|

Interpretation: how large the largest-magnitude coordinate is.

A quick comparison table #

NormFormulaWhat it emphasizesTypical use
v‖₁∑ᵢvᵢ
v‖₂√(∑ᵢ vᵢ²)geometric length, energygeometry, least squares, k-means default
v‖∞maxᵢvᵢ

Unit balls: same rule, different geometry #

A great way to feel norms is to look at their unit balls in 2D: the set of vectors whose norm ≤ 1.

This matters because “closest point” problems (like clustering) depend on the shape of these balls.

Scaling behavior (absolute homogeneity in action) #

If α is a scalar and v is a vector, all norms must satisfy:

‖αv‖ = |α| ‖v

For p-norms, you can see it directly:

‖αv‖ₚ

= (∑ᵢ |α vᵢ|ᵖ)¹ᐟᵖ

= (∑ᵢ (|α| |vᵢ|)ᵖ)¹ᐟᵖ

= (∑ᵢ |α|ᵖ |vᵢ|ᵖ)¹ᐟᵖ

= (|α|ᵖ ∑ᵢ |vᵢ|ᵖ)¹ᐟᵖ

= |α| (∑ᵢ |vᵢ|ᵖ)¹ᐟᵖ

= |α| ‖v‖ₚ

So p-norms automatically obey one of the key norm axioms.

Core Mechanic 2: The Three Axioms (and How to Reason With Them) #

Why axioms matter (not just definitions) #

When you rely on a norm inside an algorithm, you’re often using its properties implicitly:

The three norm axioms are the “license” that lets you do these steps safely.

1) Positive-definiteness: “length is never negative” #

For p-norms (p ≥ 1), every term |vᵢ|ᵖ is ≥ 0, so the sum is ≥ 0, so the p-th root is ≥ 0.

Also, ‖v‖ₚ = 0 implies:

(∑ᵢ |vᵢ|ᵖ)¹ᐟᵖ = 0

⇒ ∑ᵢ |vᵢ|ᵖ = 0

A sum of nonnegative numbers is 0 only if each term is 0:

∀i, |vᵢ|ᵖ = 0 ⇒ |vᵢ| = 0 ⇒ vᵢ = 0

So v = 0.

2) Absolute homogeneity: “scaling scales length” #

You already saw the derivation in the previous section. This property is what makes norms behave like a geometric length. It also prevents weird measures like “length(2v) = length(v) + 7.”

3) Triangle inequality: the most powerful rule #

Triangle inequality is often the hardest to prove, but the easiest to use.

It says:

u + v‖ ≤ ‖u‖ + ‖v

A very common corollary is a bound on differences (sometimes called a reverse triangle inequality variant):

|‖u‖ − ‖v‖| ≤ ‖uv

This tells you: if two vectors are close, their lengths are close.

Here’s the derivation using triangle inequality twice.

Start with:

u = (uv) + v

Apply triangle inequality:

u‖ = ‖(uv) + v‖ ≤ ‖uv‖ + ‖v

Rearrange:

u‖ − ‖v‖ ≤ ‖uv‖ (1)

Swap u and v:

v‖ − ‖u‖ ≤ ‖vu‖ = ‖uv‖ (2)

Combine (1) and (2):

|‖u‖ − ‖v‖| ≤ ‖uv

This inequality is a frequent tool when analyzing iterative algorithms.

Norm equivalence intuition (why different norms still relate) #

In finite-dimensional spaces like ℝⁿ, all norms are “equivalent” in the sense that they bound each other up to constants. Practically: if one norm is small, the others can’t be arbitrarily huge.

For the three norms we care about, these inequalities are especially useful:

  1. v‖∞ ≤ ‖v‖₂ ≤ ‖v‖₁

Reasoning (intuition):

  1. And with dimension n, you can also bound the other direction:

v‖₁ ≤ √n ‖v‖₂

v‖₂ ≤ √n ‖v‖∞

These tell you a key high-dimensional fact: the gap between norms can grow with √n. So in large n, your choice of norm can meaningfully change distances and nearest neighbors.

A note about p < 1 #

You might see “‖v‖ₚ” written for p < 1 in some ML contexts (e.g., sparsity). But for p < 1, triangle inequality fails, so it’s not a true norm. People still use it as a penalty, but you lose some guarantees.

Application/Connection: Norms as Distances in Clustering (and Why Choice Matters) #

Why clustering cares about norms #

Clustering groups points by proximity. But “proximity” is defined by a distance, and a distance is often built from a norm:

Change the norm → change the geometry → change which points are nearest → change the clustering.

K-means and why it “likes” L2 #

Classic k-means is typically presented with squared Euclidean distance:

minimize ∑ (over points x) ‖xμ(cluster(x))‖₂²

Why L2²? Because it pairs perfectly with means.

If you take a single cluster and want the best center c to minimize:

J(c) = ∑ᵢ ‖xᵢ − c‖₂²

the minimizer is the componentwise mean. That makes the update step simple and fast.

(If you instead used L1 distance, the best “center” is a median per coordinate, leading to k-medians. So the norm choice changes the algorithm’s natural center.)

L1 distance: robustness and “city block” geometry #

With L1, distances add coordinate-wise. This can be more robust to outliers in some settings and can better match data where moving along axes is natural (think grid/city streets).

Geometrically, L1 balls are diamonds in 2D. That tends to create cluster boundaries aligned differently than L2.

L∞ distance: worst-case deviation #

L∞ treats the distance between x and y as the largest coordinate difference:

xy‖∞ = maxᵢ |xᵢ − yᵢ|

This is useful when you care about maximum error in any feature (e.g., tolerances). In clustering, it makes points “close” if they’re close in every coordinate (no big deviation allowed).

Practical guidance: choosing a norm #

If your notion of similarity is…ConsiderWhy
straight-line geometric closenessL2rotation-invariant, common default
total absolute deviation across featuresL1robust-ish, encourages axis-aligned structure
worst-feature deviation must be smallL∞enforces uniform closeness across coordinates

Norms, scaling, and feature normalization #

Norm-based distances are sensitive to units. If one coordinate is measured in dollars and another in millimeters, the large-scale coordinate can dominate any norm.

A common fix is to normalize features (e.g., z-score standardization). This is not a norm concept by itself, but norms make the need obvious: the distance ‖xy‖ depends on coordinate scales.

Connecting back to dot product #

Because ‖v‖₂ = √(v · v), anything that uses dot products often implicitly uses L2 norms. Examples:

So norms complete the story: dot products measure alignment; norms measure magnitude; together they describe angle and distance.

Worked Examples (3) #

Compute ‖**v**‖₁, ‖**v**‖₂, and ‖**v**‖∞ for a concrete vector #

Let v = (3, −4, 12). Compute the L1, L2, and L∞ norms.

  1. Compute L1:

    v‖₁ = |3| + |−4| + |12|

    = 3 + 4 + 12

    = 19

  2. Compute L2:

    v‖₂ = √(3² + (−4)² + 12²)

    = √(9 + 16 + 144)

    = √169

    = 13

  3. Compute L∞:

    v‖∞ = max(|3|, |−4|, |12|)

    = max(3, 4, 12)

    = 12

Insight: Each norm answers a different question: L1 totals all coordinate magnitudes (19), L2 gives geometric length (13), and L∞ reports the largest coordinate magnitude (12).

Use triangle inequality to bound a hard-to-compute norm #

Let u and v be vectors. Suppose ‖u‖₂ = 5 and ‖v‖₂ = 2. You do not know the angle between them. Bound ‖u + v‖₂ and also bound the possible values of ‖uv‖₂.

  1. Upper bound ‖u + v‖₂ using triangle inequality:

    u + v‖₂ ≤ ‖u‖₂ + ‖v‖₂

    = 5 + 2

    = 7

  2. Lower bound ‖u + v‖₂ using the reverse-triangle form:

    |‖u‖₂ − ‖v‖₂| ≤ ‖u + v‖₂

    So:

    u + v‖₂ ≥ |5 − 2| = 3

  3. So the combined bound is:

    3 ≤ ‖u + v‖₂ ≤ 7

  4. Now bound ‖uv‖₂ similarly.

    Triangle inequality gives:

    uv‖₂ ≤ ‖u‖₂ + ‖v‖₂ = 7

  5. Reverse-triangle gives:

    uv‖₂ ≥ |‖u‖₂ − ‖v‖₂| = 3

  6. So:

    3 ≤ ‖uv‖₂ ≤ 7

Insight: Even without knowing directions, the norm axioms let you bound outcomes. In algorithms, these bounds can guarantee stability (values can’t blow up unexpectedly).

How norm choice changes which point is closer #

Let a = (2, 0) and b = (1, 1). Compare their distances to the origin under L1, L2, and L∞.

  1. Compute norms for a:

    a‖₁ = |2| + |0| = 2

    a‖₂ = √(2² + 0²) = √4 = 2

    a‖∞ = max(|2|, |0|) = 2

  2. Compute norms for b:

    b‖₁ = |1| + |1| = 2

    b‖₂ = √(1² + 1²) = √2 ≈ 1.414

    b‖∞ = max(|1|, |1|) = 1

  3. Compare:

    • •Under L1: tie (both 2)
    • •Under L2: b is closer (≈ 1.414 < 2)
    • •Under L∞: b is closer (1 < 2)

Insight: Two points can be equally close under one norm and not under another. That’s why changing the norm can change nearest neighbors and therefore clustering results.

Key Takeaways #

Common Mistakes #

Practice #

easy

Let v = (−1, 2, −2, 4). Compute ‖v‖₁, ‖v‖₂, and ‖v‖∞.

Hint: Use absolute values for L1 and L∞. For L2, square first, then sum, then take √.

Show solution

v‖₁ = |−1|+|2|+|−2|+|4| = 1+2+2+4 = 9.

v‖₂ = √((−1)²+2²+(−2)²+4²) = √(1+4+4+16) = √25 = 5.

v‖∞ = max(1,2,2,4) = 4.

medium

Suppose ‖u‖₁ = 10 and ‖v‖₁ = 6. Give the tightest bounds you can (using only norm axioms) for ‖u + v‖₁.

Hint: Use triangle inequality for an upper bound and the reverse-triangle form for a lower bound.

Show solution

Upper bound: ‖u+v‖₁ ≤ ‖u‖₁ + ‖v‖₁ = 16.

Lower bound: |‖u‖₁ − ‖v‖₁| ≤ ‖u+v‖₁ ⇒ ‖u+v‖₁ ≥ |10−6| = 4.

So 4 ≤ ‖u+v‖₁ ≤ 16.

medium

Find a nonzero vector v ∈ ℝ² such that ‖v‖₁ = 2 but ‖v‖∞ = 1. Then compute ‖v‖₂.

Hint: You want the max coordinate magnitude to be 1, and the sum of absolute values to be 2.

Show solution

One example is v = (1, 1). Then ‖v‖₁ = |1|+|1|=2 and ‖v‖∞=max(1,1)=1.

Compute L2: ‖v‖₂ = √(1²+1²) = √2.

Connections #

Prerequisite reinforcement: Dot Product

Unlocks and next steps: Clustering

Related ideas you’ll likely meet soon:

Quality: B (4.0/5)

← back to treebrowse all →