Token Embeddings

←Back to Tech Tree

inventorycoverage

Token Embeddings #

Machine LearningDifficulty: ★★★☆☆Depth: 0Unlocks: 2

Representation of discrete tokens (words, subwords, tokens) as continuous vectors used as input to neural models; includes learned embeddings and embedding lookup/initialization. Understanding embeddings covers dimensionality, lookup tables, and basic properties like semantic similarity in vector space.

Interactive Visualization #

⏮◀◀▶▶STEP0.25x1xZOOM

t=0s

Core Concepts #

Key Symbols & Notation #

E (embedding matrix)d (embedding dimensionality)

Essential Relationships #

Unlocks (2) #

Sequence Masking (causal and padding masks)lvl 4Transformerslvl 5

Advanced Learning Details

Graph Position #

6

Depth Cost

2

Fan-Out (ROI)

1

Bottleneck Score

0

Chain Length

Cognitive Load #

6

Atomic Elements

30

Total Elements

L1

Percentile Level

L4

Atomic Level

All Concepts (11) #

Teaching Strategy #

Deep-dive lesson - accessible entry point but dense material. Use worked examples and spaced repetition.

Neural networks don’t naturally understand discrete symbols like “cat”, “##ing”, or token ID 50256. Token embeddings are the bridge: they turn a token into a continuous vector v ∈ ℝᵈ that a model can compute with.

TL;DR:

A token embedding is a learned vector representation for each token in a vocabulary. All embeddings live in an embedding matrix E ∈ ℝ^(V×d). Given a token ID i, the model retrieves the i-th row E[i] (an embedding lookup) to produce eᵢ ∈ ℝᵈ, which becomes the input to downstream layers (e.g., attention/Transformer blocks). Embeddings are parameters: they’re initialized, updated by backprop, and their dimensionality d controls capacity and compute.

What Is Token Embedding? #

Why we need embeddings (motivation before formulas) #

Neural networks are built to process numbers: vectors, matrices, and tensors. But language (and many discrete domains) start as symbols:

A token by itself has no natural numeric geometry. Token ID 7 isn’t “closer” to token ID 8 than to token ID 9000, yet a neural network will treat raw numbers that way.

So we introduce a representation that:

  1. Is numeric (so the model can compute),

  2. Has a geometry (so “similar” tokens can end up close),

  3. Is learnable (so it adapts to the training data).

That representation is the token embedding.

Definition #

Let the vocabulary size be V (number of distinct tokens) and let the embedding dimensionality be d.

We store an embedding matrix:

Each token i (an integer ID in {0, …, V−1}) is assigned the i-th row of E:

This is a lookup table: you don’t compute E[i] by multiplying all of E by something dense; you directly retrieve a row.

Intuition: embeddings as “coordinates” #

You can think of embeddings as placing each token at a point in a d-dimensional space. During training, the model moves these points around to make its predictions better.

This doesn’t guarantee a perfect semantic map, but it gives the model a flexible continuous “surface” on which to build meaning.

What embeddings are not #

Where embeddings appear in a model #

In a typical language model pipeline:

  1. Text → tokenizer → token IDs (integers)

  2. Token IDs → embedding lookup in E → vectors e₁, e₂, …

  3. Vectors → Transformer (or other neural network) → predictions

So token embeddings are often the first learned layer of a modern NLP model.

Core Mechanic 1: Embedding Lookup via the Matrix E #

Why lookup matters #

A token ID is discrete. The model must map it to a continuous vector e ∈ ℝᵈ efficiently.

If V is large (e.g., 50k, 200k, 1M), you cannot afford to treat the input as a dense V-dimensional vector at every step. The embedding matrix lets you:

One-hot view (conceptual bridge) #

Conceptually, you can represent token i as a one-hot vector xᵢ ∈ {0,1}^V with a 1 at position i.

Then embedding lookup can be written as a matrix product:

Let’s check shapes:

This multiplication selects exactly one row of E, because all entries of xᵢ are 0 except at i.

Why implementations don’t multiply #

Even though xᵢᵀ E is a nice equation, real implementations do an index operation:

Because multiplying by a V-length one-hot vector would waste memory and compute.

Batching and sequences #

In practice you have a batch of sequences:

Embedding lookup produces:

So each position t in each sequence b gets a vector:

A careful note about “row vectors” vs “column vectors” #

Different sources use different conventions. You might see embeddings as rows E[i] or columns. The key invariant is:

In this lesson we treat E as V rows, each row is eᵢ ∈ ℝᵈ.

Parameter count and memory #

Embeddings can dominate parameter count.

Example:

Parameters = 50,000 · 768 = 38,400,000

That’s 38.4M parameters just for token embeddings.

Basic similarity geometry #

Once tokens are vectors, you can measure similarity.

Two common measures:

  1. Dot product:
  1. Cosine similarity:

Cosine similarity compares direction, not magnitude. Many embedding analyses use cosine similarity to focus on relational structure.

Why similarity emerges at all #

Embeddings are optimized to help the model predict correct outputs. If the model benefits from treating two tokens similarly (because they appear in similar contexts), gradient descent tends to push their embeddings in similar directions.

This is not magic; it’s just shared pressure from the loss function across many training examples.

Core Mechanic 2: Embeddings Are Learned Parameters (Initialization and Updates) #

Why embeddings must be learnable #

You could assign each token a random vector and never change it. The model would then have to learn everything in later layers, with no ability to shape the input representation.

Learnable embeddings let the model:

Formally, E is part of the model parameters θ.

Initialization #

Common initialization strategies:

A typical goal is to keep activations at reasonable scale early in training.

If σ is too large:

If σ is too small:

How the embedding row gets updated (the key idea) #

Suppose in one training step you see token i at some position. The forward pass retrieves eᵢ = E[i]. The loss L depends on that vector through downstream computations.

Backprop computes the gradient:

But importantly:

This matches the lookup behavior: you only “touch” the embeddings you used.

A small derivation: gradient for a simple downstream linear layer #

Consider a toy model:

  1. Lookup: eᵢ = E[i]

  2. Linear: z = Weᵢ + b

  3. Loss L depends on z

Let’s compute ∂L/∂eᵢ.

We use chain rule:

But z = Weᵢ + b, so:

Thus:

And since eᵢ is the i-th row of E, the gradient for E[i] is exactly ∂L/∂eᵢ.

Finally, a gradient descent update (learning rate η):

Every time token i appears, its vector is nudged to reduce loss.

Frequency effects #

Tokens that appear more often get updated more often.

This can be good (more data) but also can cause imbalance:

Some tokenization strategies (subwords) help reduce the number of truly rare tokens by composing words out of more frequent pieces.

Embedding tying (brief but important) #

In many language models, the input embedding matrix E is tied (shared) with the output projection matrix used to predict token logits.

If output logits are computed as:

then E serves two roles:

This reduces parameters and often improves performance, but it couples constraints: the same geometry must serve both input and output.

Pretrained embeddings vs learned from scratch #

You can initialize E using pretrained vectors (word2vec/GloVe) or from a pretrained transformer.

Comparing options:

ApproachProsConsWhen used
Train E from scratchSimple; fully task-adaptedNeeds lots of data; slow to learn semanticsNew domains, sufficient data
Initialize from pretrainedFaster convergence; better semantics earlyMay mismatch tokenizer/vocab; can bake in biasesMany NLP tasks
Freeze pretrained EStable; fewer trainable paramsLimits adaptation; can hurt performanceLow-data or constrained training

Even when using pretrained embeddings, fine-tuning (updating E) is common.

Embedding Dimensionality d: Capacity, Generalization, and Cost #

Why d matters #

The embedding dimensionality d is the length of each token vector eᵢ ∈ ℝᵈ.

Choosing d is a trade-off between:

Capacity intuition #

With larger d, each token has more degrees of freedom.

But “bigger d” isn’t automatically better. If you don’t have enough data or model structure to use it, the extra dimensions can become noisy.

Memory and compute scaling #

Parameters in E scale linearly in d:

The embedding activations for a batch scale as:

So increasing d increases both parameter memory and the size of tensors passed through attention/MLP blocks.

Matching d to model hidden size #

In transformers, token embeddings are usually produced in the same dimension as the model’s hidden size (often called d_model).

That way, you can add other vectors (like positional encodings) and pass embeddings directly into attention blocks without extra projections.

A geometric view: dot products and norms #

Many downstream operations depend on dot products.

For random vectors with independent components, dot products tend to grow with d unless normalized. This is one reason initialization and normalization layers matter.

If a, b have typical component scale σ, then expected magnitude:

So as d grows, norms grow unless σ shrinks. Good initialization tries to keep these scales stable.

Practical heuristics #

Common rules of thumb (not laws):

The key is: d is a design knob controlling representational bandwidth at the input.

Application/Connection: Embeddings as the Input Layer to Transformers (and What Comes Next) #

Where embeddings sit in the Transformer pipeline #

A simplified Transformer input step:

  1. Token IDs X ∈ {0,…,V−1}^(B×T)

  2. Token embeddings: H_tok[b,t] = E[X[b,t]]

  3. Add positional information: H₀ = H_tok + P

  4. Pass through stacked attention + MLP blocks

The crucial point: attention doesn’t operate on token IDs; it operates on vectors.

Positional information and why it’s separate #

Token embeddings alone do not encode order. The tokens “dog bites man” vs “man bites dog” would be the same multiset of embeddings.

Transformers typically add a positional encoding/embedding P ∈ ℝ^(T×d) (or learned position embeddings) so each position t has its own vector pₜ.

Then:

This simple addition works because both are in ℝᵈ.

Masking connection (why you’ll need it soon) #

Once you have embeddings and pass them into attention, the model computes attention scores between positions.

But you often must prevent attention to:

Masking operates on attention score matrices, not on E directly—but embeddings are what make attention possible in the first place.

Embeddings beyond language #

The same idea applies broadly:

Whenever you have a discrete symbol set, an embedding matrix is a standard first tool.

What to watch for when you implement #

A minimal checklist:

These decisions will directly affect training stability and correctness.

Worked Examples (3) #

Example 1: Embedding lookup as row selection (and the one-hot equivalence) #

Let V = 5 and d = 3. Suppose the embedding matrix is

E =

[ [ 1, 0, 2],

[ 0, 1, 0],

[-1, 1, 1],

[ 2, 2, 2],

[ 0, -1, 3] ]

Token ID i = 2 (0-indexed). Compute e₂ via lookup and via one-hot multiplication.

  1. Lookup definition:

    e₂ = E[2]

    So e₂ = [−1, 1, 1].

  2. Construct one-hot x₂ ∈ ℝ^5:

    x₂ = [0, 0, 1, 0, 0].

  3. Compute x₂ᵀ E:

    x₂ᵀ E = 0·E[0] + 0·E[1] + 1·E[2] + 0·E[3] + 0·E[4]

    = E[2]

    = [−1, 1, 1].

  4. Conclusion:

    The matrix formula eᵢ = xᵢᵀ E is exactly row selection in disguise.

Insight: Thinking in one-hot form explains the math, but thinking in lookup form explains the efficiency: only one row is needed, so gradients and updates also stay sparse over rows.

Example 2: One gradient update to an embedding row #

Toy model: z = Weᵢ (ignore bias). Let d = 2 and W =

[ [2, 0],

[0, 1] ]

Assume token i appears, with current embedding eᵢ = [1, −1]. Suppose backprop gives ∂L/∂z = [3, 4]. Compute ∂L/∂eᵢ and do one gradient step with η = 0.1.

  1. We have z = Weᵢ. By chain rule:

    ∂L/∂eᵢ = Wᵀ (∂L/∂z).

  2. Compute Wᵀ. Here W is diagonal-like, so Wᵀ = W:

    Wᵀ =

    [ [2, 0],

    [0, 1] ].

  3. Multiply:

    ∂L/∂eᵢ = Wᵀ [3, 4]ᵀ

    = [ 2·3 + 0·4,

    0·3 + 1·4 ]

    = [6, 4].

  4. Gradient descent update:

    eᵢeᵢ − η (∂L/∂eᵢ)

    = [1, −1] − 0.1·[6, 4]

    = [1 − 0.6, −1 − 0.4]

    = [0.4, −1.4].

  5. Interpretation:

    Only the embedding for token i is updated by this example; other token rows E[j] for j ≠ i are unchanged (for this single-token toy batch).

Insight: Embedding training is ordinary parameter learning; the only special feature is sparsity over vocabulary rows: you update the rows you looked up.

Example 3: Measuring semantic similarity with cosine similarity #

Suppose you have two token embeddings:

a = [2, 0, 1]

b = [1, 1, 0]

Compute dot product and cosine similarity.

  1. Dot product:

    a·b = 2·1 + 0·1 + 1·0 = 2.

  2. Norms:

    a‖ = √(2² + 0² + 1²) = √(4 + 0 + 1) = √5.

    b‖ = √(1² + 1² + 0²) = √2.

  3. Cosine similarity:

    cos(a, b) = (a·b) / (‖a‖‖b‖)

    = 2 / (√5 · √2)

    = 2 / √10

    ≈ 0.632.

Insight: Cosine similarity normalizes away vector length, which is useful because embedding norms can vary for reasons unrelated to meaning (frequency, training dynamics, regularization).

Key Takeaways #

Common Mistakes #

Practice #

easy

You have vocabulary size V = 10,000 and embedding dimensionality d = 256.

  1. How many parameters are in E?

  2. If parameters are stored as 32-bit floats, about how many megabytes does E take (ignore overhead)?

Hint: params = V·d. Memory ≈ params · 4 bytes. 1 MB ≈ 10⁶ bytes (roughly).

Show solution

  1. params(E) = 10,000 · 256 = 2,560,000.

  2. Memory ≈ 2,560,000 · 4 = 10,240,000 bytes ≈ 10.24 MB (about 10 MB).

medium

Let E ∈ ℝ^(4×2) be

E =

[ [ 0, 1],

[ 2, 0],

[−1, 3],

[ 4, −2] ]

A sequence of token IDs is [3, 1, 1, 0]. Write down the corresponding embedding vectors in order, and identify which rows of E are reused.

Hint: Lookup means E[i] is the i-th row. Reuse happens when the same ID appears multiple times.

Show solution

Embeddings:

ID 3 → E[3] = [4, −2]

ID 1 → E[1] = [2, 0]

ID 1 → E[1] = [2, 0]

ID 0 → E[0] = [0, 1]

Row reuse: row 1 is reused (appears twice).

hard

Suppose you are training a model and you notice that very rare tokens have poorly learned embeddings.

Give two strategies (modeling or preprocessing) that can help, and briefly explain why each helps.

Hint: Think about how often a token gets gradient updates and how tokenization affects frequency. Also consider parameter sharing or regularization.

Show solution

Two helpful strategies:

  1. Use subword tokenization (BPE/WordPiece): rare words are decomposed into more frequent pieces, so the model learns embeddings for pieces with more updates, improving generalization to rare words.

  2. Tie embeddings or use pretrained initialization: tying input/output embeddings shares statistical strength; pretrained embeddings (or starting from a pretrained LM) give rare tokens a better starting position in vector space, reducing the amount of task data needed to shape them.

(Other valid ideas include increasing data, using adaptive/hashed embeddings, or regularizing/averaging embeddings for low-frequency tokens.)

Connections #

Quality: A (4.5/5)

← back to treebrowse all →