Vector Embeddings

←Back to Tech Tree

inventorycoverage

Vector Embeddings #

Machine LearningDifficulty: ★★★★☆Depth: 1Unlocks: 2

Continuous vector representations that encode discrete items (words, tokens, or features) into a dense numeric space where geometric relationships reflect semantic or functional similarity. Embeddings are the typical inputs to attention layers and determine how items interact via similarity and projection.

Interactive Visualization #

⏮◀◀▶▶STEP0.25x1xZOOM

t=0s

Core Concepts #

Key Symbols & Notation #

e_x - embedding vector for item x (real-valued vector)

Essential Relationships #

Prerequisites (1) #

Cosine Similarity6 atoms

Unlocks (1) #

Attention Mechanismslvl 5

Advanced Learning Details

Graph Position #

11

Depth Cost

2

Fan-Out (ROI)

1

Bottleneck Score

1

Chain Length

Cognitive Load #

5

Atomic Elements

34

Total Elements

L2

Percentile Level

L3

Atomic Level

All Concepts (13) #

Teaching Strategy #

Self-serve tutorial - low prerequisites, straightforward concepts.

Modern ML models can’t directly “think” in words, IDs, or categories. They can only compute with numbers—especially vectors. Vector embeddings are the bridge: they turn discrete items into continuous v ∈ ℝᵈ so geometry (dot products, angles, distances) becomes a usable language for meaning and function.

TL;DR:

A vector embedding is a learned mapping x ↦ eₓ ∈ ℝᵈ from a discrete item (token/word/category) to a dense vector. Similar items end up with similar directions (high cosine similarity) and often similar positions (small distance). Embeddings are usually implemented as a trainable lookup table (an “embedding matrix”) and are the standard inputs to attention, where dot products between embeddings produce relevance scores.

What Is Vector Embeddings? #

Why we need them (motivation) #

Discrete items—like words, tokens, product IDs, user IDs, categorical features, or graph nodes—don’t naturally live in a space where “closeness” or “similarity” is meaningful.

Embeddings solve this by giving each item a dense vector eₓ ∈ ℝᵈ. Once items are vectors, we can compare them with dot products, cosine similarity, and distances. That makes “interaction” between items (in attention, retrieval, classification, etc.) a simple geometric computation.

Definition #

An embedding is a function (often learned) that maps a discrete item x to a continuous vector:

We write:

In many neural networks, Emb(·) is implemented as a lookup table (a matrix) with one vector per item.

Intuition: “meaning as location” #

The core intuition is not that embeddings store dictionary definitions. Instead:

For language, distributional learning yields the classic idea: “You know a word by the company it keeps.” If two words appear in similar contexts, a training objective will encourage their embeddings to become similar.

Dense vs sparse representations #

A one-hot vector x ∈ {0,1}^|V| is sparse: exactly one 1 and the rest 0. An embedding eₓ ∈ ℝᵈ is dense: typically all d components are nonzero.

A useful comparison:

RepresentationDimensionalitySimilarity structureParametersProsCons
Integer ID1none0compactno geometry
One-hotVorthogonal0
Embeddingd (e.g., 128–4096)learned geometryV·d

Embeddings are “compact but expressive.” They trade fixed semantics (one-hot) for learnable semantics (geometry).

The key symbol #

We’ll use eₓ to denote the embedding vector for item x:

And we’ll often compare embeddings using cosine similarity (prerequisite):

cos(a, b) = (a · b) / (‖a‖ ‖b‖)

Cosine similarity focuses on direction, which is often what matters in learned representation spaces.

What embeddings are not #

  1. 1)Not uniquely defined: Many embedding spaces are equivalent up to rotation/reflection, because objectives depend on dot products and distances.
  2. 2)Not guaranteed “human semantic”: They capture what helps the training loss, which may include biases or spurious correlations.
  3. 3)Not always global coordinates: Some models care about relative comparisons (angles/dots) more than absolute axes.

Embeddings are a learned coordinate system designed to make downstream computation easy.

Core Mechanic 1: The Embedding Matrix (Lookup Table) and How It Learns #

Why a lookup table works #

When x is a discrete ID, the simplest parameterization is: assign a trainable vector to each ID. Collect these vectors into a matrix E.

Let:

Row i of E is the embedding for token i:

This is literally a learned table.

From one-hot to embedding: a clean algebraic view #

If x is a one-hot vector representing item i, then the embedding lookup is equivalent to matrix multiplication:

Compute:

e = xE

Because x has a single 1 at index i, xE selects the i-th row.

This equivalence is useful for understanding gradients: the model updates the row(s) corresponding to the IDs it sees.

How embeddings get trained (gradient intuition) #

Embeddings are learned because they participate in a loss. The loss might come from:

Regardless of the objective, the embedding vectors are parameters. During backprop, the gradient ∂L/∂eₓ updates eₓ.

A simple mental model:

A concrete training objective: softmax classifier from embeddings #

Consider a very common pattern: predict a label y from an embedding h (which could be eₓ or a contextual vector). Use a linear layer and softmax:

Loss for true class c:

L = −log p_c

Even if you don’t memorize softmax gradients, the key is:

So embeddings become whatever vectors make the rest of the network succeed.

Embedding dimension d: capacity vs generalization #

Why not make d enormous?

A useful rule of thumb: pick d based on data scale, vocabulary size, and task complexity. In transformers, d often matches the model width so embeddings can be added to positional encodings and fed into attention.

Weight tying: input embeddings and output embeddings #

In language models, there are often two related matrices:

Sometimes these are tied (shared): output weight matrix W is set to Eᵀ.

Why tie?

Initialization and scale #

Embeddings are usually initialized randomly with small variance. Scale matters because dot products and norms affect attention scores and softmax logits.

In attention, if dot products get too large, softmax can saturate (become too peaky). That’s one reason scaled dot-product attention uses 1/√d.

Even before attention, stable embedding scales help optimization.

Special tokens and feature embeddings #

Embeddings aren’t just for words:

In all cases, the embedding is a learned vector that becomes a “handle” for the model to condition on.

Summary of this mechanic #

This gives us the basic object: eₓ. Next we’ll focus on the geometry of embeddings—what dot products and angles mean, and why that geometry becomes the substrate for attention.

Core Mechanic 2: Embedding Geometry—Similarity, Dot Products, and Distance #

Why geometry matters #

Once items are vectors, the model can compute interactions with fast linear algebra. The most common interaction is a dot product:

Dot products are cheap, differentiable, and deeply connected to angles and lengths. This is why embeddings pair naturally with attention.

But: dot products depend on both direction and magnitude. Cosine similarity removes magnitude to focus on direction:

cos(a, b) = (a · b) / (‖a‖ ‖b‖)

So you should keep two pictures in mind:

  1. 1)Directional similarity (angle) → cosine similarity
  2. 2)Positional proximity (distance) → Euclidean distance

Angle vs distance #

For a, b ∈ ℝᵈ:

Expand squared distance:

ab‖²

= (ab) · (ab)

= a·a − 2a·b + b·b

= ‖a‖² − 2(a·b) + ‖b‖²

This shows a key link:

If we L2-normalize embeddings (force ‖eₓ‖ = 1), then:

ab‖² = 2 − 2(a·b)

and since a·b = cos(a, b) for unit vectors:

ab‖² = 2 − 2 cos(a, b)

So for normalized embeddings:

This is why many retrieval systems store normalized embeddings.

What makes embeddings “semantic” (or functional) #

Embeddings encode whatever similarity the training process rewards.

The geometry becomes a compressed record of these pressures.

Similarity is task-relative #

Two important consequences:

  1. 1)If you train embeddings for sentiment classification, “good” and “great” may cluster, but also “awful” and “terrible.” That’s still semantic—but shaped by sentiment.
  2. 2)If you train embeddings for code completion, “for” and “while” may cluster functionally.

So “semantic similarity” is better read as useful similarity under the objective.

The dot product as “compatibility” #

In many neural modules, the dot product between embeddings stands for compatibility.

This is a learned notion of compatibility because q, k, u, v come from learned embeddings or learned projections of embeddings.

Anisotropy and frequency effects #

Real embedding spaces often have quirks:

These effects can harm retrieval or similarity search because cosine similarities become less discriminative.

Mitigations include:

Regularization and constraints #

Sometimes we add constraints to shape geometry:

Each choice changes how dot products translate into probabilities.

Embeddings as points, but also as basis coordinates #

Another way to view embedding components:

You’ll occasionally find interpretable directions (e.g., sentiment), but that’s not guaranteed.

Summary of this mechanic #

Application/Connection: Embeddings as the Inputs to Attention (and Beyond) #

Why attention needs embeddings #

Attention mechanisms operate on vectors. If your input is discrete tokens, you must first map them to vectors—embeddings.

In a transformer, a typical pipeline is:

  1. 1)Tokenize text → token IDs x₁, …, xₙ
  2. 2)Lookup embeddings → e₁, …, eₙ where eᵢ = e_{xᵢ}
  3. 3)Add positional information → hᵢ⁽⁰⁾ = eᵢ + p
  4. 4)Apply attention layers to compute contextual vectors

So embeddings are the “raw material” that attention will mix.

From embeddings to queries/keys/values #

Self-attention doesn’t usually compare raw embeddings directly. It projects them:

Then attention scores use dot products:

score(i, j) = (qᵢ · kⱼ) / √d_k

and weights are softmax over j.

Even though q, k, v are projected, their source is the embedding space. The structure of embeddings strongly influences what the model can learn efficiently:

Embeddings beyond tokens #

Transformers also embed:

The concept is the same: a discrete unit becomes a vector so attention can compare and combine units.

Retrieval and nearest neighbors #

Embeddings are also used for retrieval:

This is the backbone of semantic search and RAG systems.

If embeddings are normalized, ranking by cosine similarity is equivalent to ranking by dot product.

Recommendation systems #

User and item embeddings are classic:

A simple predictor is a dot product:

ŷ = u · v

If user u likes items similar to v, the training objective increases u·v, bringing vectors into alignment.

Practical considerations #

Memory: embedding tables can dominate parameter count when |V| is large.

Common solutions:

OOV and rare items:

Fine-tuning vs freezing:

Connection forward: why this unlocks Attention Mechanisms #

Attention is fundamentally about similarity-weighted mixing. Similarity is computed by dot products between vectors.

Embeddings provide:

Once you have eₓ, attention can form q, k, v and compute relevance.

This is why “Vector Embeddings” is a prerequisite: without vectorization, there is no meaningful similarity computation to drive attention.

Next node: Attention Mechanisms.

Worked Examples (3) #

Example 1: Embedding lookup as matrix multiplication (one-hot × embedding matrix) #

Suppose a tiny vocabulary V = {A, B, C} with |V| = 3 and embedding dimension d = 2. Let the embedding matrix be

E = [

[1, 0],

[2, 1],

[0, 3]

] (shape 3×2)

Rows correspond to A, B, C in that order. We want the embedding for token B.

  1. Represent token B as a one-hot vector x ∈ ℝ³:

    x = [0, 1, 0]

  2. Compute the embedding as e = xE:

    e = [0, 1, 0] · [

    [1, 0],

    [2, 1],

    [0, 3]

    ]

  3. Multiply:

    e = 0·[1, 0] + 1·[2, 1] + 0·[0, 3]

    = [2, 1]

  4. So e_B = [2, 1].

Insight: An embedding lookup is algebraically “one-hot selection.” In backprop, only the selected row(s) receive gradients, which is why embedding tables train efficiently even for huge vocabularies.

Example 2: Relating cosine similarity and Euclidean distance for normalized embeddings #

Let a, b ∈ ℝ² be two unit vectors (‖a‖ = ‖b‖ = 1). Suppose a·b = 0.8. Compute cos(a, b) and ‖ab‖², and interpret.

  1. Because both vectors are unit length, cosine similarity equals the dot product:

    cos(a, b) = (a·b) / (‖a‖‖b‖) = 0.8 / (1·1) = 0.8

  2. Compute squared distance using the expansion:

    ab‖²

    = ‖a‖² − 2(a·b) + ‖b‖²

    = 1 − 2(0.8) + 1

  3. Finish:

    ab‖² = 2 − 1.6 = 0.4

  4. Optionally compute distance:

    ab‖ = √0.4 ≈ 0.632

Insight: For unit-normalized embeddings, high cosine similarity implies small Euclidean distance (and vice versa). This is why many retrieval systems normalize embeddings: it makes geometry consistent and simplifies ranking.

Example 3: A tiny attention-like similarity score from embeddings #

Suppose you have three token embeddings in ℝ²:

e₁ = [1, 0]

e₂ = [1, 1]

e₃ = [0, 1]

Treat token 2 as a “query” and compute raw dot-product scores sⱼ = e₂ · eⱼ for j ∈ {1,2,3}.

  1. Compute s₁ = e₂ · e₁:

    s₁ = [1, 1] · [1, 0] = 1·1 + 1·0 = 1

  2. Compute s₂ = e₂ · e₂:

    s₂ = [1, 1] · [1, 1] = 1 + 1 = 2

  3. Compute s₃ = e₂ · e₃:

    s₃ = [1, 1] · [0, 1] = 0 + 1 = 1

  4. Interpretation: token 2 is most similar to itself (score 2) and equally similar to tokens 1 and 3 (score 1).

Insight: Attention scoring is built on dot products of vectors. Even before adding projections (W_Q, W_K), the embedding geometry already determines which tokens are “compatible.”

Key Takeaways #

Common Mistakes #

Practice #

easy

You have normalized embeddings (‖a‖ = ‖b‖ = 1). If cos(a, b) = 0.3, compute ‖ab‖² and ‖ab‖.

Hint: Use ‖ab‖² = 2 − 2 cos(a, b) for unit vectors.

Show solution

ab‖² = 2 − 2(0.3) = 2 − 0.6 = 1.4.

ab‖ = √1.4 ≈ 1.183.

medium

Let |V| = 50,000 and d = 768. Approximately how many parameters are in the embedding table? If stored as float32 (4 bytes), about how much memory does it take?

Hint: Parameters = |V|·d. Memory = parameters × 4 bytes. Convert to MB or GB.

Show solution

Parameters = 50,000 × 768 = 38,400,000.

Memory ≈ 38.4 million × 4 bytes = 153.6 million bytes.

In MB: 153.6e6 / (1024²) ≈ 146.5 MB (about 150 MB).

medium

Suppose E ∈ ℝ^{4×3} is an embedding matrix for tokens {0,1,2,3}. If a training batch contains tokens [1, 1, 3], which rows of E receive gradient updates during backprop through an embedding lookup? Explain briefly.

Hint: Only looked-up rows are involved in the forward computation; repeated tokens accumulate gradients on the same row.

Show solution

Rows 1 and 3 receive gradient updates. Row 1 appears twice in the batch, so its gradient contributions accumulate (sum) for that row. Rows 0 and 2 receive no update from this batch because they were not looked up.

Connections #

Next: Attention Mechanisms

Related nodes you may want nearby in the tech tree:

Quality: A (4.5/5)

← back to treebrowse all →