Clustering

←Back to Tech Tree

inventorycoverage

Clustering #

Machine LearningDifficulty: ★★★★☆Depth: 9Unlocks: 0

K-means, hierarchical clustering. Unsupervised grouping.

Interactive Visualization #

⏮◀◀▶▶STEP0.25x1xZOOM

t=0s

Core Concepts #

Key Symbols & Notation #

k (number of clusters)mu_j (centroid or prototype of cluster j)

Essential Relationships #

Prerequisites (2) #

Machine Learning Introduction5 atomsNorms6 atoms

Referenced by (3) #

Where this concept shows up in the operating-finance and personal-finance graphs.

From Business (3) #

[customer segmentationBusiness

Customer segmentation is the canonical business application of clustering algorithms - K-means, hierarchical clustering, and density-based methods partition customers into groups by behavioral and demographic features](/business/customer-segmentation/)[target audienceBusiness

Target audience identification is fundamentally a segmentation problem - partitioning a population into groups by shared attributes. Clustering (k-means, hierarchical) is the direct mathematical technique for discovering these natural groupings in customer data.](/business/target-audience/)[targeted marketingBusiness

K-means and hierarchical clustering are the core mathematical techniques for customer segmentation - grouping customers into behaviorally or demographically similar clusters that can then be targeted with tailored marketing](/business/targeted-marketing/)

Advanced Learning Details

Graph Position #

123

Depth Cost

0

Fan-Out (ROI)

0

Bottleneck Score

9

Chain Length

Cognitive Load #

6

Atomic Elements

58

Total Elements

L4

Percentile Level

L4

Atomic Level

All Concepts (28) #

Teaching Strategy #

Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.

When you don’t have labels, clustering is the workhorse tool for discovering structure: it tries to answer “which points belong together?” using only a notion of similarity (distance).

TL;DR:

Clustering groups data points so that points in the same group are more similar than points in different groups. K-means does this by minimizing within-cluster squared distances to centroids μⱼ (fast, simple, but assumes roughly spherical clusters and needs k). Hierarchical clustering builds a dendrogram by repeatedly merging clusters using a linkage rule (more flexible shapes, but slower; you choose a cut level instead of k). Distance choice, scaling, initialization, and validation (e.g., silhouette) determine whether clusters are meaningful.

What Is Clustering? #

Clustering is an unsupervised learning task: you’re given data points x₁,…,xₙ (no labels), and you want to group them into “clusters” so that points in the same group are “similar,” and points in different groups are “dissimilar.”

The key idea is that clustering is not a single problem—it’s a family of problems defined by choices you make:

  1. Representation: what is a data point? (a vector x ∈ ℝᵈ? a document? an image embedding?)

  2. Similarity / distance: how do we measure “close”? (‖xy‖₂, cosine distance, etc.)

  3. Cluster model: what does a cluster “look like”? (a centroid-based ball, a connected component, a hierarchy)

  4. Objective / procedure: what criterion do we optimize or what rule do we follow?

A practical definition you can use:

Notice what’s missing: there is no universal “correct” clustering without extra assumptions. Two different distance metrics can yield different but equally defensible clusterings.

Why clustering is useful #

Clustering shows up in many places because it provides structure without labels:

The geometry intuition #

If your data are vectors in ℝᵈ, distance-based clustering is fundamentally geometric.

This is why we study multiple clustering paradigms. In this lesson, we’ll focus on two core methods:

We’ll also treat clustering as a pipeline: choose distance + scale features, run algorithm, then validate and interpret.

Notation we’ll use #

The major theme: clustering is about optimizing or constructing a grouping consistent with your similarity notion—and knowing when that grouping is trustworthy.

Core Mechanic 1: K-means as an Optimization Problem #

K-means is the classic centroid-based clustering algorithm. It is popular because it is simple, fast, and often surprisingly effective.

Why K-means works (motivation first) #

Suppose you believe each cluster can be summarized by a “typical” point—its centroid—and that points should belong to the cluster whose centroid is closest.

This belief translates into an objective: choose k centroids and assignments so that points are close to their assigned centroid.

The K-means objective #

Given assignments cᵢ and centroids μⱼ, the within-cluster sum of squares (WCSS) is:

J(c, μ) = ∑ᵢ ‖xᵢ − μ_{cᵢ}‖₂²

K-means solves:

minimize over assignments cᵢ and centroids μⱼ:

min J(c, μ) = ∑ᵢ ‖xᵢ − μ_{cᵢ}‖₂²

This is sometimes described as “minimizing within-cluster variance.”

Why squared Euclidean distance? #

But it also implies a geometric assumption: clusters are roughly spherical (balls) in Euclidean space.

The two-step alternating minimization #

K-means uses an iterative procedure (often called Lloyd’s algorithm):

  1. Assignment step (E-step-like):

Fix centroids μⱼ. Assign each point to the nearest centroid:

cᵢ ← argminⱼ ‖xᵢ − μⱼ‖₂²

  1. Update step (M-step-like):

Fix assignments cᵢ. Update each centroid to the mean of points in that cluster:

μⱼ ← (1 / |Cⱼ|) ∑_{i: cᵢ = j} x

These steps repeat until assignments stop changing or the objective decreases only slightly.

Show-your-work: why the centroid is the mean #

Focus on one cluster j with points {xᵢ : cᵢ = j}. We want μ that minimizes:

f(μ) = ∑_{i: cᵢ=j} ‖xᵢ − μ‖₂²

Expand using dot products:

xᵢ − μ‖₂² = (xᵢ − μ)·(xᵢ − μ)

= xᵢ·xᵢ − 2 xᵢ·μ + μ·μ

So

f(μ) = ∑ (xᵢ·xᵢ) − 2 ∑(xᵢ·μ) + ∑(μ·μ)

Let m = |Cⱼ|. Note ∑(μ·μ) = m(μ·μ).

Take the gradient with respect to μ:

∇ f(μ) = -2 ∑ xᵢ + 2 m μ

Set ∇ f(μ) = 0:

-2 ∑ xᵢ + 2 m μ = 0

⇒ m μ = ∑ x

μ = (1/m) ∑ x

So the centroid update is not a heuristic—it is the exact minimizer of the squared-distance objective given fixed assignments.

Convergence (what it guarantees and what it doesn’t) #

Each iteration:

So J decreases monotonically and the algorithm converges in finitely many steps to a local minimum.

But:

Initialization and k-means++ #

Random initialization (choose k random points as initial μⱼ) can be unstable.

k-means++ improves this by spreading initial centroids apart:

  1. Pick one centroid uniformly from the data.

  2. For each point x, compute D(x) = distance to nearest chosen centroid.

  3. Choose next centroid with probability proportional to D(x)².

  4. Repeat until k centroids are chosen.

This tends to reduce bad starts and usually improves results with minimal overhead.

Choosing k (model selection) #

K-means requires k upfront, which is often the hardest practical choice.

Common heuristics:

Silhouette score (intuition) #

For a point i:

Silhouette:

sᵢ = (bᵢ − aᵢ) / max(aᵢ, bᵢ)

Average sᵢ over i gives a score for k.

Scaling features (often the make-or-break detail) #

K-means uses Euclidean distance, so feature scaling matters.

If one feature has a large numeric range, it dominates distances.

Typical fix: standardize each feature:

x' = (x − mean) / std

Without scaling, you may cluster by “units” rather than by actual structure.

When K-means is a good fit #

When K-means struggles #

K-means is a sharp tool—but it’s sharp in a particular shape. Next we’ll study hierarchical clustering, which often provides more flexibility and interpretability via a dendrogram.

Core Mechanic 2: Hierarchical Clustering and Linkage Rules #

Hierarchical clustering builds a nested family of clusters rather than committing to a single partition with a fixed k.

Why hierarchical clustering (motivation) #

Sometimes you don’t believe the data has one “correct” k. Instead, you might believe there are multiple meaningful scales:

Hierarchical clustering captures this by producing a dendrogram (a tree) that shows how clusters merge (or split) as you change the similarity threshold.

There are two main variants:

Agglomerative clustering is more common in practice, so we’ll focus on it.

The agglomerative algorithm (conceptual steps) #

  1. Start with n clusters: { {1}, {2}, …, {n} }

  2. Compute distances between clusters using a linkage rule

  3. Merge the two closest clusters

  4. Update distances and repeat until one cluster remains

The output is a dendrogram. To get a final clustering, you cut the dendrogram at a chosen height (distance threshold) or choose a number of clusters.

Distances between clusters: linkage #

Hierarchical clustering needs a distance between sets of points (clusters), not just between individual points.

Let A and B be two clusters.

Common linkage rules:

LinkageCluster distance d(A,B)Behavior / bias
Singlemin_{x∈A, y∈B} ‖xyCan create “chains”; finds elongated shapes
Completemax_{x∈A, y∈B} ‖xyPrefers compact clusters; sensitive to outliers
Average(1/(A
Wardincrease in within-cluster SSE after mergeSimilar spirit to k-means; prefers spherical clusters

Ward’s linkage (connection to variance minimization) #

Ward’s method chooses the merge that causes the smallest increase in total within-cluster sum of squares.

If you define the within-cluster SSE for a cluster C with centroid μ_C:

SSE(C) = ∑_{x∈C} ‖xμ_C‖₂²

Ward chooses to merge A and B that minimize:

Δ(A,B) = SSE(A∪B) − SSE(A) − SSE(B)

This makes hierarchical clustering feel more “k-means-like,” but it still yields a full hierarchy.

Reading a dendrogram (what it actually tells you) #

In a dendrogram:

A “natural” clustering often appears as:

Cutting the tree at a height between those regimes yields a partition.

Complexity and scaling #

Hierarchical clustering is typically more expensive than k-means:

So it’s often used for:

Distance metrics still matter #

Hierarchical clustering can use many distances:

Different distances can change the dendrogram dramatically. Scaling features is still crucial whenever numeric ranges differ.

Strengths and weaknesses vs K-means #

Hierarchical clustering:

But:

K-means:

But:

You’ll often try both: use hierarchical clustering to understand structure (choose k), then run k-means for a scalable final partition.

Application/Connection: A Practical Clustering Workflow (and How to Trust It) #

Clustering algorithms always produce some grouping—even on random noise. The real skill is building a workflow that makes clusters meaningful and defensible.

1) Clarify the goal #

Different goals imply different choices:

Ask: what would count as a “good” cluster in your domain?

2) Choose a representation #

Raw features might not reflect similarity.

Examples:

Clustering works best when your representation makes “similarity” meaningful geometrically.

3) Scale / normalize #

If using Euclidean distance, scale each feature to comparable units.

Typical options:

MethodFormulaWhen to use
Standardizationx' = (x − mean)/stdMost common; good default
Min-max scalingx' = (x − min)/(max − min)When bounds matter
L2 normalizationx' = x/‖x‖₂For cosine-like comparisons on direction

4) Pick an algorithm and distance #

A useful mental map:

If your clusters are…Try…
compact blobsK-means or Ward hierarchical
possibly elongated / connectedSingle/average linkage hierarchical
you want multi-scale structurehierarchical clustering
very large nK-means (with k-means++)

(There are other families like DBSCAN, GMMs, spectral clustering, but this node focuses on K-means + hierarchical.)

5) Choose k or a cut #

For K-means #

For hierarchical #

6) Validate: internal, stability, and external checks #

Validation is not just a single number.

Internal metrics (no labels) #

Internal metrics can be misleading if your distance metric doesn’t match your semantic goal.

Stability checks #

Run clustering under perturbations:

If cluster assignments drastically change, the structure may be weak.

External checks (domain-based) #

Even without labels, you can often check:

7) Interpret clusters #

For each cluster j, compute:

Interpretation is often the main product of clustering.

8) Common failure modes (so you can detect them) #

A practical mitigation in high dimensions is to reduce dimension first (e.g., PCA) for clustering—carefully, because projections can also distort structure.

Connection between K-means and hierarchical (a useful mental bridge) #

So if you think “variance within clusters” is the right notion, both methods can align:

The core professional habit: treat clustering as an iterative scientific process, not a one-click algorithm.

Worked Examples (3) #

K-means on a tiny 2D dataset (one full iteration, with real numbers) #

Data points in ℝ²:

x₁=(1,1), x₂=(1,2), x₃=(4,4), x₄=(5,4)

Choose k=2. Initialize centroids:

μ₁=(1,1), μ₂=(5,4). Use squared Euclidean distance.

  1. Assignment step: compute ‖xᵢ−μⱼ‖₂².

    For x₁=(1,1):

    ‖(1,1)−(1,1)‖² = 0

    ‖(1,1)−(5,4)‖² = (−4)²+(−3)² = 16+9 = 25

    ⇒ c₁=1.

    For x₂=(1,2):

    ‖(1,2)−(1,1)‖² = 0²+1² = 1

    ‖(1,2)−(5,4)‖² = (−4)²+(−2)² = 16+4 = 20

    ⇒ c₂=1.

    For x₃=(4,4):

    ‖(4,4)−(1,1)‖² = 3²+3² = 9+9 = 18

    ‖(4,4)−(5,4)‖² = (−1)²+0² = 1

    ⇒ c₃=2.

    For x₄=(5,4):

    ‖(5,4)−(1,1)‖² = 4²+3² = 16+9 = 25

    ‖(5,4)−(5,4)‖² = 0

    ⇒ c₄=2.

  2. Update step: recompute centroids as the mean of assigned points.

    Cluster 1 has x₁, x₂:

    μ₁ ← ( (1,1) + (1,2) ) / 2 = ( (2,3) ) / 2 = (1, 1.5)

    Cluster 2 has x₃, x₄:

    μ₂ ← ( (4,4) + (5,4) ) / 2 = ( (9,8) ) / 2 = (4.5, 4)

  3. Objective after update (optional check):

    J = ∑ᵢ ‖xᵢ − μ_{cᵢ}‖²

    For cluster 1 with μ₁=(1,1.5):

    ‖(1,1)−(1,1.5)‖² = 0²+(−0.5)² = 0.25

    ‖(1,2)−(1,1.5)‖² = 0²+(0.5)² = 0.25

    For cluster 2 with μ₂=(4.5,4):

    ‖(4,4)−(4.5,4)‖² = (−0.5)²+0² = 0.25

    ‖(5,4)−(4.5,4)‖² = (0.5)²+0² = 0.25

    Total J = 1.0

Insight: K-means alternates between “nearest centroid” assignments and “mean of assigned points” updates. The arithmetic mean appears because it exactly minimizes the sum of squared Euclidean distances within each cluster.

Why feature scaling changes the clustering (distance domination example) #

Two features: height (cm) and income (USD). Consider two people:

A: (170 cm, 50,000)

B: (180 cm, 50,500)

C: (171 cm, 120,000)

Use Euclidean distance on raw features vs standardized features.

  1. Raw distances:

    Between A and B:

    Δheight=10, Δincome=500

    ‖A−B‖₂ = √(10² + 500²) = √(100 + 250,000) = √250,100 ≈ 500.10

    Between A and C:

    Δheight=1, Δincome=70,000

    ‖A−C‖₂ = √(1² + 70,000²) ≈ 70,000.00

  2. Observation: income dwarfs height.

    Even a 10 cm difference barely matters compared to $500.

    So clustering will be driven almost entirely by income scale, not necessarily by what you intend.

  3. Standardize each feature (conceptually):

    Let height'=(height−mean)/std_height and income'=(income−mean)/std_income.

    After standardization, typical variations in height and income become comparable (unitless).

  4. Resulting effect:

    Distances become sensitive to both features.

    Now A is close to C in height, but far in income; whether A clusters with B or C depends on relative standardized differences, not raw units.

Insight: Clustering is only as meaningful as your distance metric—and Euclidean distance is extremely sensitive to feature scale. Standardization is often not optional; it defines what “similar” means.

Agglomerative hierarchical clustering by hand (single linkage on 1D points) #

Points on a line (ℝ¹): {0, 2, 3, 10}. Use single linkage with Euclidean distance. Build the merge sequence and interpret a dendrogram cut.

  1. Start with singleton clusters:

    {0}, {2}, {3}, {10}

    Pairwise point distances:

    |0−2|=2, |0−3|=3, |0−10|=10

    |2−3|=1, |2−10|=8

    |3−10|=7

  2. Single linkage distance between clusters is the minimum pairwise distance.

    Closest pair is {2} and {3} with distance 1.

    Merge:

    C₁ = {2,3} at height 1.

    Now clusters: {0}, {2,3}, {10}

  3. Compute distances:

    d({0},{2,3}) = min(|0−2|,|0−3|)=min(2,3)=2

    d({10},{2,3}) = min(|10−2|,|10−3|)=min(8,7)=7

    d({0},{10}) = 10

    Closest is {0} and {2,3} at distance 2.

    Merge:

    C₂ = {0,2,3} at height 2.

    Now clusters: {0,2,3}, {10}

  4. Final merge:

    d({0,2,3},{10}) = min(|10−0|,|10−2|,|10−3|)=min(10,8,7)=7

    Merge at height 7.

    Dendrogram heights: 1, 2, 7.

  5. Interpretation by cutting:

    If you cut at height between 2 and 7 (e.g., 4), you get 2 clusters:

    {0,2,3} and {10}.

    If you cut below 2 (e.g., 1.5), you get 3 clusters:

    {0}, {2,3}, {10}.

Insight: Hierarchical clustering gives you a family of clusterings. The linkage rule controls the geometry: single linkage tends to merge via nearest neighbors, which can produce chained clusters.

Key Takeaways #

Common Mistakes #

Practice #

easy

You have points x₁=(0,0), x₂=(0,1), x₃=(5,5), x₄=(6,5). Run one full K-means iteration with k=2 starting from μ₁=(0,0), μ₂=(6,5). Give assignments and updated centroids.

Hint: Compute squared distances to each centroid, assign each point to the closer one, then average points in each cluster to update μⱼ.

Show solution

Assignments by squared distance:

x₁ to μ₁ (0 vs large)

x₂ to μ₁ (1 vs large)

x₃ to μ₂ (distance to (6,5) is 1; to (0,0) is 50)

x₄ to μ₂ (0 vs 61)

Updated centroids:

μ₁ = ((0,0)+(0,1))/2 = (0, 0.5)

μ₂ = ((5,5)+(6,5))/2 = (5.5, 5)

medium

For hierarchical clustering, explain (in 2–4 sentences) how single linkage and complete linkage can produce different cluster shapes. Then give one scenario where single linkage is a poor choice.

Hint: Think about min vs max distances between points across clusters, and how a single ‘bridge’ point can connect groups.

Show solution

Single linkage uses d(A,B)=min distance between points, so clusters can merge through nearest-neighbor chains and form long, thin shapes. Complete linkage uses d(A,B)=max distance, which discourages merges that would create large diameters and thus prefers compact clusters. Single linkage is a poor choice when noise points create ‘bridges’ between two real groups, causing them to chain together into one cluster.

hard

Derive the K-means centroid update: show that μ that minimizes f(μ) = ∑_{i=1}^m ‖xᵢ − μ‖₂² is the mean (1/m)∑ᵢ xᵢ.

Hint: Expand the squared norm, take ∇ with respect to μ, set it to 0, and solve for μ.

Show solution

f(μ) = ∑ᵢ (xᵢ−μ)·(xᵢ−μ) = ∑ᵢ (xᵢ·xᵢ) − 2∑ᵢ(xᵢ·μ) + ∑ᵢ(μ·μ).

Let m be the number of points. Then ∑ᵢ(μ·μ) = m(μ·μ).

Take gradient:

∇f(μ) = -2∑ᵢ xᵢ + 2mμ.

Set ∇f(μ) = 0:

-2∑ᵢ xᵢ + 2mμ = 0 ⇒ mμ = ∑ᵢ xᵢ ⇒ μ = (1/m)∑ᵢ xᵢ.

Connections #

Machine Learning Introduction

Norms

Principal Component Analysis (PCA)

Gaussian Mixture Models

DBSCAN

Silhouette Score and Cluster Validation

Quality: A (4.4/5)

← back to treebrowse all →