Dimensionality Reduction

←Back to Tech Tree

inventorycoverage

Dimensionality Reduction #

Machine LearningDifficulty: ★★★★☆Depth: 11Unlocks: 0

Reducing feature dimensions. PCA, t-SNE, autoencoders.

Interactive Visualization #

⏮◀◀▶▶STEP0.25x1xZOOM

t=0s

Core Concepts #

Key Symbols & Notation #

f: R^D -> R^d (d << D)

Essential Relationships #

Prerequisites (2) #

Principal Component Analysis6 atomsNeural Networks6 atoms

Advanced Learning Details

Graph Position #

220

Depth Cost

0

Fan-Out (ROI)

0

Bottleneck Score

11

Chain Length

Cognitive Load #

5

Atomic Elements

52

Total Elements

L4

Percentile Level

L3

Atomic Level

All Concepts (21) #

Teaching Strategy #

Quick unlock - significant prerequisite investment but simple final step. Verify prerequisites first.

Modern datasets often live in spaces with hundreds, thousands, or millions of features—but the structure we care about is frequently much simpler. Dimensionality reduction is the art of finding that simpler structure without throwing away what matters.

TL;DR:

Dimensionality reduction learns a mapping f: ℝᴰ → ℝᵈ (with d ≪ D) that embeds high-dimensional points into a lower-dimensional space while preserving a chosen property (variance, distances, neighborhoods, or reconstruction). Linear methods (PCA) preserve global variance; nonlinear methods like t-SNE preserve local neighborhoods for visualization; autoencoders learn task-driven nonlinear embeddings by compressing and reconstructing data.

What Is Dimensionality Reduction? #

Why it exists (motivation first) #

High-dimensional feature spaces create three recurring problems:

  1. Computation and storage: Many ML algorithms scale poorly with D. Pairwise distances, covariance matrices, nearest-neighbor search, and kernel methods can become expensive.

  2. Generalization and sample efficiency: With limited data, many degrees of freedom invite overfitting. In broad strokes, if your model can wiggle in D directions, you need enough data to constrain those directions.

  3. Geometry gets weird in high dimensions (“curse of dimensionality”): Distances concentrate, neighborhoods become sparse, and local density estimation becomes hard. Many intuitive operations (like “find the nearest neighbors”) degrade unless the data truly lies on a lower-dimensional structure.

Dimensionality reduction aims to address these issues by learning an embedding that is lower-dimensional but still useful.

The core definition #

You are given data points x ∈ ℝᴰ. Dimensionality reduction chooses a smaller dimension d (typically 2, 3, 10, 50, 256, …) and learns a mapping

f: ℝᴰ → ℝᵈ with d ≪ D

so that the embedding z = f(x) retains something you care about.

The key phrase is “retain something you care about.” Dimensionality reduction is not one algorithm; it is a design pattern:

Three atomic concepts (the “tech tree” lens) #

  1. Low-dimensional embedding
  1. Preservation criterion
  1. Mapping mechanism

A helpful taxonomy #

Dimensionality reduction methods often differ along two axes:

AxisOption AOption B
MappingParametric (explicit f; can embed new points)Nonparametric (embeds training set; out-of-sample is nontrivial)
Geometry preservedGlobal (variance, long-range distances)Local (neighbors, manifold structure)

And a third practical axis:

GoalTypical method
Visualization (2D/3D plots)t-SNE, UMAP, PCA (as baseline)
Compression for downstream modelsPCA, autoencoders
Denoising / representation learningDenoising autoencoders, contractive autoencoders
Preprocessing for clustering / kNNPCA, UMAP (careful), sometimes random projections

What dimensionality reduction is not #

The central skill: match the preservation criterion to your end use.

Core Mechanic 1: Preservation Criteria (What are you trying to keep?) #

Why criteria matter #

If you only remember one thing: dimensionality reduction is defined by what it preserves. Two embeddings with the same d can behave radically differently depending on the criterion.

A useful way to think:

Different algorithms pick different L.

Criterion A: Preserve variance (PCA-style, global, linear) #

Even though you already know PCA, it’s worth anchoring it as the prototypical global criterion.

Let centered data be = xμ.

Pick an orthonormal basis W = [w₁ … wᵈ] ∈ ℝᴰ×ᵈ with WW = I.

Projection:

PCA chooses W to maximize captured variance:

maximize ∑ᵢ ‖Wᵢ‖²

Equivalently, minimize reconstruction error:

minimize ∑ᵢ ‖ᵢ − W Wᵢ‖²

Why this matters:

Criterion B: Preserve pairwise distances (MDS-style) #

Suppose you care about distances between points. Let δᵢⱼ be distances in the original space (often Euclidean):

δᵢⱼ = ‖xᵢ − xⱼ‖

You want:

zᵢ − zⱼ‖ ≈ δᵢⱼ

A classic objective (stress) is:

Stress = ∑ᵢ<ⱼ (‖zᵢ − zⱼ‖ − δᵢⱼ)²

This is global: it tries to match all pairwise distances. In high dimensions, however, global distance preservation in very low d can be impossible, so tradeoffs are inevitable.

Criterion C: Preserve local neighborhoods (t-SNE/UMAP-style) #

Visualization methods often care more about “who is near whom” than exact distances.

t-SNE does this via probability distributions.

  1. Convert high-D distances into conditional neighbor probabilities:

pⱼ|ᵢ ∝ exp(−‖xᵢ − xⱼ‖² / (2σᵢ²))

σᵢ is chosen so that each point has a roughly fixed effective number of neighbors (the perplexity parameter controls this).

  1. Define low-D similarities with a heavy-tailed distribution (Student-t):

qᵢⱼ ∝ (1 + ‖zᵢ − zⱼ‖²)⁻¹

  1. Minimize KL divergence so that neighbors in high-D stay neighbors in low-D:

KL(P ‖ Q) = ∑ᵢ ∑ⱼ pᵢⱼ log(pᵢⱼ / qᵢⱼ)

What this criterion buys you:

What it does not guarantee:

UMAP has a similar “local connectivity” spirit but uses a graph/fuzzy simplicial set objective; practically, it often preserves more global structure than t-SNE, but still primarily optimizes neighborhood relations.

Criterion D: Preserve reconstructability (autoencoder-style) #

If your downstream tasks require information beyond neighborhoods—say, you need a compressed representation that can regenerate the input—then reconstruction is a natural criterion.

An autoencoder learns:

with embedding z = f_θ(x) and reconstruction = g_φ(z).

Typical loss:

L(θ, φ) = ∑ᵢ ‖xᵢ − g_φ(f_θ(xᵢ))‖²

or for images, sometimes cross-entropy / perceptual losses.

This criterion is task-agnostic but not “semantics-agnostic”: the learned z depends strongly on network architecture, bottleneck size d, regularization, and data.

Choosing a criterion: a decision table #

If you need…Preserve…Typical choiceNotes
Fast linear compression, denoising, baselineVariance / reconstruction (linear)PCAStrong baseline; interpretable components
2D/3D visualization of clustersLocal neighborhoodst-SNE, UMAPDo not over-interpret global geometry
Approximate distance geometryPairwise distancesMDS / IsomapHard in very low d; can distort heavily
Nonlinear features for downstream modelsReconstruction + regularizationAutoencoder / VAEParametric; can embed new data

A practical workflow is often: PCA → (optional) t-SNE/UMAP for visualization; or PCA → autoencoder for nonlinear compression.

Core Mechanic 2: Mapping Mechanisms (How do you produce the embedding?) #

Why mapping mechanisms matter #

Two methods can preserve “local neighborhoods,” but one may:

This matters whenever you want to deploy an embedding in production, feed it into a classifier, or update it as new data arrives.

Mechanism A: Linear projections #

The simplest parametric mapping:

f(x) = Wᵀ(xμ)

Properties:

When it shines:

Mechanism B: Kernel methods (brief positioning) #

Kernel PCA (and related methods) implicitly map data to a high-dimensional feature space via a kernel k(x, x′) and then do PCA there.

It enables nonlinear embeddings while keeping an eigenvector-based formulation.

Tradeoffs:

This node focuses on PCA/t-SNE/autoencoders, but kernel ideas explain a common theme: nonlinear geometry without explicit neural networks.

Mechanism C: Nonparametric neighborhood embedding (classic t-SNE) #

Classic t-SNE learns {zᵢ} directly for the dataset. There is no explicit f.

Consequences:

So, think of t-SNE as: optimize coordinates, not a function.

Mechanism D: Neural network encoders (autoencoders) #

An autoencoder defines an explicit parametric mapping f_θ.

A typical structure:

Because f_θ is explicit:

But because it is flexible:

Regularizing the mapping (making embeddings usable) #

A pure reconstruction objective can yield embeddings that are hard to use for clustering or interpolation. Common fixes:

  1. Denoising autoencoder

Loss:

L = ∑ᵢ ‖xᵢ − g_φ(f_θ(ᵢ))‖²

Effect: forces z to capture robust structure.

  1. Contractive penalty (encourage local smoothness)

Penalize Jacobian magnitude:

L = reconstruction + λ ‖∂f_θ(x)/∂x‖_F²

Interpretation: small input changes shouldn’t cause wild latent changes.

  1. Variational autoencoder (VAE) (probabilistic latent space)

Instead of a point z, encoder outputs a distribution q_θ(z|x) and regularizes it toward a prior p(z) (often 𝒩(0, I)).

Objective (ELBO):

maximize ∑ᵢ 𝔼_{q_θ(z|xᵢ)}[log p_φ(xᵢ|z)] − KL(q_θ(z|xᵢ) ‖ p(z))

Effect: more continuous, “organized” latent spaces—useful for generation and interpolation.

A practical comparison table #

Methodf: ℝᴰ → ℝᵈ explicit?PreservesStrengthWeakness
PCAYes (linear)Global variance / linear reconstructionFast, stable, interpretableMisses nonlinear structure
t-SNENo (classic)Local neighborhoodsExcellent 2D/3D cluster visualizationGlobal geometry unreliable; hard out-of-sample
AutoencoderYes (nonlinear)Reconstruction (plus whatever regularization encourages)Parametric, flexible, scalableNeeds tuning; can learn unhelpful latents

The high-level skill: choose the simplest mapping mechanism that can satisfy your preservation criterion.

Applications and Connections (How it’s used in real ML pipelines) #

1) Visualization as a diagnostic tool #

A 2D embedding is often used to answer questions like:

Best practice:

Why the PCA pre-step helps:

Interpretation warnings (especially t-SNE):

2) Compression for downstream learning #

Using z as input to a classifier/regressor can:

PCA is common when:

Autoencoders are common when:

3) Nearest-neighbor search and retrieval #

Many retrieval systems use embeddings:

Note: This often becomes metric learning, where the preservation criterion is label- or relevance-driven (contrastive/triplet losses). Dimensionality reduction is the geometric core: you still learn f: ℝᴰ → ℝᵈ.

4) Denoising and anomaly detection #

If you learn a low-dimensional model of “normal” data, anomalies often reconstruct poorly.

Autoencoder anomaly score:

score(x) = ‖x − g_φ(f_θ(x))‖

Caveat: Powerful autoencoders can sometimes reconstruct anomalies too well. Regularization and capacity control matter.

5) Manifold hypothesis and representation learning #

A guiding intuition: real-world data often lies near a manifold of much smaller intrinsic dimension than D.

But “manifold” is a hypothesis, not a guarantee. Dimensionality reduction is how you test whether a low-dimensional structure exists and is useful.

6) How to validate an embedding #

Validation depends on your purpose:

A simple sanity checklist:

  1. Standardize or whiten features when appropriate.

  2. Fit on train set only; apply transform to validation/test.

  3. Don’t select d using test-set performance.

  4. For t-SNE: avoid drawing conclusions from global geometry.

Dimensionality reduction is not a one-time trick; it is a model choice that should be evaluated like any other.

Worked Examples (3) #

Example 1: From PCA variance to reconstruction error (why PCA is a compression method) #

You have centered data points {ᵢ} in ℝᴰ and choose an orthonormal projection matrix W ∈ ℝᴰ×ᵈ (so WW = I). Show that maximizing captured variance ∑ᵢ ‖Wᵢ‖² is equivalent to minimizing reconstruction error ∑ᵢ ‖ᵢ − W****Wᵢ‖².

  1. Define the projection and reconstruction:

    zᵢ = W

    x̂̃ᵢ = W zᵢ = W****W

  2. Write reconstruction error for one point:

    ᵢ − x̂̃ᵢ‖² = ‖ᵢ − W****Wᵢ‖²

  3. Use orthogonal decomposition: since W****Wᵀ is the projector onto the span of columns of W,

    ᵢ = (W****Wᵢ) + ( ( IW****Wᵀ ) ᵢ )

    The two components are orthogonal.

  4. Apply Pythagoras:

    ᵢ‖² = ‖W****Wᵢ‖² + ‖( IW****Wᵀ )ᵢ‖²

  5. Notice the reconstruction residual is exactly the orthogonal component:

    ᵢ − W****Wᵢ‖² = ‖( IW****Wᵀ )ᵢ‖²

  6. So rearrange:

    ᵢ − W****Wᵢ‖² = ‖ᵢ‖² − ‖W****Wᵢ‖²

  7. Sum over i:

    ∑ᵢ ‖ᵢ − W****Wᵢ‖² = ∑ᵢ ‖ᵢ‖² − ∑ᵢ ‖W****Wᵢ‖²

  8. Use WW = I to simplify the retained term:

    W****Wᵢ‖² = ‖Wᵢ‖²

    (because projecting then measuring norm equals norm of coordinates in an orthonormal basis).

  9. Final equivalence:

    ∑ᵢ ‖ᵢ − W****Wᵢ‖² = constant − ∑ᵢ ‖Wᵢ‖²

    So minimizing reconstruction error is equivalent to maximizing captured variance.

Insight: PCA’s “maximize variance” and “minimize squared reconstruction error” are the same problem because orthogonal projection splits each point into a kept part plus an orthogonal residual, and squared norms add cleanly.

Example 2: What t-SNE is optimizing (local neighborhood preservation via KL divergence) #

You have points xᵢ in ℝᴰ and want a 2D embedding zᵢ in ℝ² for visualization. Explain how t-SNE turns distances into probabilities and why the KL(P ‖ Q) objective emphasizes preserving local neighbors.

  1. Start from the design goal: keep local neighborhoods.

    Instead of trying to match all distances, define “neighborliness” as a probability distribution.

  2. Define conditional probabilities in high-D:

    pⱼ|ᵢ = exp(−‖xᵢ − xⱼ‖² / (2σᵢ²)) / ∑_{k≠i} exp(−‖xᵢ − x_k‖² / (2σᵢ²))

    Interpretation: points closer to i get higher probability mass.

  3. Choose σᵢ to match a target perplexity.

    Perplexity is 2^{H(P_i)} where H is Shannon entropy; practically, it sets an effective neighbor count so each point adapts to local density.

  4. Symmetrize to get joint probabilities:

    pᵢⱼ = (pⱼ|ᵢ + pᵢ|ⱼ) / (2n)

    Now P = {pᵢⱼ} describes neighbor affinities in high-D.

  5. Define low-D affinities with a heavy tail:

    qᵢⱼ = (1 + ‖zᵢ − zⱼ‖²)⁻¹ / ∑_{a≠b} (1 + ‖z_a − z_b‖²)⁻¹

    The Student-t tail prevents distant points from all collapsing together (it combats the “crowding problem”).

  6. Optimize with KL divergence:

    KL(P ‖ Q) = ∑ᵢ ∑ⱼ pᵢⱼ log(pᵢⱼ / qᵢⱼ)

    Only pairs with large pᵢⱼ (true neighbors) contribute strongly to the loss.

  7. Reason about asymmetry:

    Because it is KL(P ‖ Q), if pᵢⱼ is large but qᵢⱼ is small, the penalty is severe.

    If pᵢⱼ is tiny (non-neighbors) but qᵢⱼ is somewhat large, the penalty is mild.

    So the optimization prioritizes: “don’t lose real neighbors,” rather than “perfectly separate all non-neighbors.”

  8. Conclusion:

    t-SNE is engineered to make local structure visually salient (clusters/neighborhoods) even if global distances become distorted.

Insight: t-SNE’s choice of KL(P ‖ Q) (not KL(Q ‖ P)) is deliberate: it heavily punishes missing true neighbors, which is exactly what you want for cluster visualization—but it also explains why global geometry is not reliable.

Example 3: Autoencoder as dimensionality reduction (bottleneck, reconstruction, and why regularization matters) #

You want f_θ: ℝᴰ → ℝᵈ with d ≪ D to compress vectors x while keeping enough information for downstream tasks. You train an autoencoder with squared reconstruction loss. Explain what the bottleneck forces the model to do, and why a denoising variant often yields better embeddings.

  1. Define the model:

    Encoder: z = f_θ(x) ∈ ℝᵈ

    Decoder: = g_φ(z) ∈ ℝᴰ

  2. Train with reconstruction loss:

    L(θ, φ) = ∑ᵢ ‖xᵢ − g_φ(f_θ(xᵢ))‖²

  3. Interpret the bottleneck:

    Because d ≪ D, z cannot store arbitrary details of x.

    The network must learn a compressed code capturing the most predictable/shared factors across the dataset.

  4. Identify a failure mode:

    If the network has too much capacity (especially the decoder), it may learn a representation that reconstructs well but is not smooth or meaningful for similarity (e.g., it may encode idiosyncratic details in a brittle way).

  5. Apply denoising:

    Sample a corruption process C (Gaussian noise, masking, dropout) and form = C(x).

    Train:

    L = ∑ᵢ ‖xᵢ − g_φ(f_θ(ᵢ))‖²

  6. Explain why denoising helps:

    To reconstruct clean x from , the encoder must capture stable structure and ignore noise.

    This tends to produce embeddings z that align better with semantic factors and are more robust for downstream tasks.

Insight: Autoencoders reduce dimension by forcing information through a bottleneck. Regularization (like denoising) changes what gets preserved: instead of memorizing details, the embedding must capture robust, repeatable structure.

Key Takeaways #

Common Mistakes #

Practice #

easy

You need a 2D plot to inspect whether your labeled dataset has separable classes. Which method would you try first: PCA or t-SNE? Explain what each will preserve and what could mislead you in the visualization.

Hint: Think: global variance vs local neighborhoods; and what you want to learn from a plot.

Show solution

Start with PCA as a baseline because it is fast, deterministic, and preserves global variance (large-scale directions). If classes separate in the top PCs, that’s strong evidence of linear separability in the original space.

Then try t-SNE if PCA does not show separation but you suspect nonlinear/local structure. t-SNE preserves local neighborhoods, often making clusters visible even when PCA looks mixed.

Misleading aspects: PCA can hide nonlinear separability; t-SNE can invent visually separated clusters and its inter-cluster distances (and relative cluster sizes) are not reliable measures of true global separation.

medium

Suppose you trained a classic (nonparametric) t-SNE embedding on n points and now receive a new point x_new. Why is it nontrivial to compute z_new? Name two practical strategies to handle this situation.

Hint: Ask: does t-SNE learn f(x) or only {zᵢ}? What does “out-of-sample” mean here?

Show solution

Classic t-SNE optimizes the coordinates {zᵢ} directly; it does not learn an explicit mapping f: ℝᴰ → ℝᵈ. Therefore, there is no built-in way to embed x_new without changing the optimization problem.

Two practical strategies:

  1. Rerun (or continue) t-SNE including x_new and all prior points (expensive; may change existing positions).

  2. Learn an approximate parametric map after the fact, e.g., train a regression model or neural network to predict z from x using the existing pairs (xᵢ, zᵢ), or use parametric t-SNE variants that learn an explicit encoder.

hard

You want a nonlinear embedding z for downstream classification. You train an autoencoder with very low reconstruction error, but a classifier trained on z performs poorly compared to training on the original features. Give two reasons this can happen and one modification to your training that could improve z.

Hint: Reconstruction ≠ discriminative usefulness. Think about what information the autoencoder is incentivized to keep and how it might ignore label-relevant structure.

Show solution

Two reasons:

  1. The autoencoder optimizes reconstruction, not class separation. It may dedicate capacity to reconstructing high-variance nuisance factors (background, lighting, frequent but label-irrelevant details) and compress away subtle label-relevant cues.

  2. With high-capacity encoder/decoder, the model can reconstruct well using a latent code that is not smooth or not aligned with semantic similarity; good reconstruction can coexist with a latent space where classes are entangled.

One modification:

Use a regularized or task-informed objective, e.g. a denoising autoencoder (train to reconstruct clean x from corrupted ) to encourage robust features, or add a supervised/contrastive term so that embeddings of same-class points are closer while different-class points are farther. Either changes the preservation criterion from “just reconstruct” toward “reconstruct robustly” or “preserve label structure.”

Connections #

Prerequisites you can lean on:

Natural next nodes this unlocks conceptually:

Quality: A (4.5/5)

← back to treebrowse all →