Sequence-to-Sequence Modeling

←Back to Tech Tree

inventorycoverage

Sequence-to-Sequence Modeling #

Machine LearningDifficulty: ★★★★☆Depth: 1Unlocks: 2

The paradigm of mapping input sequences to output sequences (e.g., translation or summarization), including encoder-decoder architectures and alignment concepts; attention mechanisms are often introduced to improve information flow between encoder and decoder. Familiarity with seq2seq setups clarifies why and how attention is applied.

Interactive Visualization #

⏮◀◀▶▶STEP0.25x1xZOOM

t=0s

Core Concepts #

Key Symbols & Notation #

H = (h_1,...,h_S) : sequence of encoder hidden states (one vector per source position).alpha_t = (alpha_{t,1},...,alpha_{t,S}) : attention weight vector at decoder step t (softmax-normalized).

Essential Relationships #

Prerequisites (1) #

Softmax Function6 atoms

Unlocks (1) #

Attention Mechanismslvl 5

Advanced Learning Details

Graph Position #

11

Depth Cost

2

Fan-Out (ROI)

1

Bottleneck Score

1

Chain Length

Cognitive Load #

5

Atomic Elements

38

Total Elements

L2

Percentile Level

L3

Atomic Level

All Concepts (16) #

Teaching Strategy #

Self-serve tutorial - low prerequisites, straightforward concepts.

Mini-scenario (we’ll keep using it):

You’re building a tiny English→French translator for a five-word sentence.

Source (English, length S=5):

  1. 1)I
  2. 2)eat
  3. 3)green
  4. 4)apples
  5. 5)today

Target (French, length T=5):

  1. 1)Je
  2. 2)mange
  3. 3)des
  4. 4)pommes
  5. 5)vertes

If you translate word-by-word, you’ll get stuck: in French, “green” (vertes) often comes after “apples” (pommes). So when the decoder is producing the last word “vertes”, it must “look back” to the English word “green” at position 3—even though it already produced “pommes” at position 4.

This lesson is about the modeling paradigm that makes that possible: sequence-to-sequence (seq2seq) modeling. We’ll build up from the probability objective P(y₁..y_T | x₁..x_S), then see why plain encoder→decoder has an information bottleneck, and how alignment/attention uses αₜ to let the decoder read from the encoder states H=(h₁,…,h_S) at every output step.

Quick schematic (we’ll refer back to it):

x₁ x₂ x₃ x₄ x₅

v v v v v

[h₁][h₂][h₃][h₄][h₅] = H (encoder states)

\ | /

\ | /

αₜ (softmax over 1..S)

v

cₜ (context)

v

decoder state sₜ → yₜ

TL;DR:

Seq2seq models learn a conditional distribution over output sequences given an input sequence: P(y₁..y_T | x₁..x_S). An encoder maps the source tokens to hidden states H=(h₁,…,h_S). A decoder generates tokens autoregressively, using previously generated tokens and (optionally) a context vector. Attention/alignment computes a per-step weight vector αₜ over encoder positions, forming a context cₜ=∑ᵢ αₜ,ᵢ hᵢ so the decoder can dynamically “read” the right parts of the input while generating each output token.

What Is Sequence-to-Sequence (Seq2seq) Modeling? #

Why this paradigm exists #

Many real problems are not “predict a single label” but “produce a whole sequence whose length may differ from the input.” Examples:

The defining feature is variable-length in, variable-length out, and output tokens depend on each other.

In our running example:

The goal is to model the conditional probability of the entire output sequence given the input sequence:

P(y1,…,yT∣x1,…,xS).P(y_1,\ldots,y_T \mid x_1,\ldots,x_S).P(y1​,…,yT​∣x1​,…,xS​).

Autoregressive factorization (the key probabilistic move) #

We don’t predict the whole sequence at once. We factorize it using the chain rule:

P(y1,…,yT∣x1:S)=∏t=1TP(yt∣y<t,x1:S).P(y_1,\ldots,y_T \mid x_{1:S}) = \prod_{t=1}^{T} P(y_t \mid y_{<t}, x_{1:S}).P(y1​,…,yT​∣x1:S​)=t=1∏T​P(yt​∣y<t​,x1:S​).

Here y<ty_{<t}y<t​ means (y1,…,yt−1)(y_1,\ldots,y_{t-1})(y1​,…,yt−1​).

This single equation captures the seq2seq contract:

Encoder–decoder: turning the contract into an architecture #

To implement this, we split responsibilities:

  1. 1)Encoder reads the input sequence and produces internal representations.
  2. 2)Decoder generates the output sequence one token at a time.

Historically, the encoder and decoder were RNNs (LSTM/GRU). Today, they are often Transformers—but the encoder/decoder roles persist.

A classic mental model:

The bottleneck that motivates attention #

Early seq2seq used a single vector (often the final encoder hidden state) as a summary of the entire source. This works for very short sentences but degrades on longer ones: the decoder must squeeze everything about x1:Sx_{1:S}x1:S​ through a fixed-size bottleneck.

Our five-word example is short, but it already hints at a deeper need: when the decoder generates “vertes” (green), it should access information from source position 3. If all information is compressed into a single vector, the decoder has to remember precise token-level details across steps.

This is why alignment/attention matters: it provides a direct path from each output step back to the relevant encoder states.

What you should be able to say after this section #

Core Mechanic 1: Encoder–Decoder Modeling (Without Attention) #

Why understand the no-attention version first? #

Attention can feel like “extra machinery.” But it’s easier to appreciate once you’ve seen what the encoder–decoder is trying to do on its own—and where it struggles.

We’ll describe a standard formulation using hidden states and probability distributions. Even if you later use a Transformer, the ideas map cleanly:

Step 1: Represent tokens as vectors #

Tokens are discrete symbols. Models convert them into continuous vectors.

Step 2: Encoder produces hidden states #

Let the encoder produce hidden states hᵢ (one per source position):

In an RNN encoder, a typical recurrence is:

hi=fenc(hi−1,e(xi)).h_i = f_{enc}(h_{i-1}, e(x_i)).hi​=fenc​(hi−1​,e(xi​)).

Intuition:

Even in a Transformer encoder, you still end up with one vector per token position; you can still call them hih_ihi​.

Step 3: The simplest bottleneck: a single context vector #

A classic early approach defines a single vector c summarizing the source, for example the last hidden state:

c=hS.c = h_S.c=hS​.

(There are other choices: pooling over HHH, or concatenating last states of a bidirectional RNN. But the core idea is “fixed-size summary.”)

Step 4: Decoder state and token generation #

The decoder maintains a hidden state sₜ and emits a distribution over the next token:

  1. Update decoder state:

st=fdec(st−1,e(yt−1),c).s_t = f_{dec}(s_{t-1}, e(y_{t-1}), c).st​=fdec​(st−1​,e(yt−1​),c).

  1. Produce logits and probabilities:

ot=Wost+boo_t = W_o s_t + b_oot​=Wo​st​+bo​

P(yt∣y<t,x1:S)=softmax(ot).P(y_t \mid y_{<t}, x_{1:S}) = \text{softmax}(o_t).P(yt​∣y<t​,x1:S​)=softmax(ot​).

The softmax is your prerequisite: it turns logits into a probability distribution over the vocabulary.

Training objective: teacher forcing + cross-entropy #

During training, we usually feed the true previous token rather than the model’s sampled token (teacher forcing). The loss is negative log-likelihood:

L=−∑t=1Tlog⁡P(yt\∣y<t\,x1:S).\mathcal{L} = -\sum_{t=1}^{T} \log P(y_t^{\} \mid y_{<t}^{\}, x_{1:S}).L=−t=1∑T​logP(yt\*​∣y<t\*​,x1:S​).

Where yt\y_t^{\}yt\*​ is the ground-truth token.

Inference: start/end tokens and decoding strategies #

At inference time:

Decoding choices:

StrategyHow it chooses yty_tyt​ProsCons
Greedyarg⁡max⁡\arg\maxargmax over softmaxFastCan miss better global sequences
Beam searchKeep top-K partial sequencesBetter translations oftenSlower; can prefer generic outputs
SamplingSample from distributionDiverse outputsCan be unstable without controls

Why the bottleneck is a real problem #

If ccc is fixed, then every output token must be generated from the same compressed summary.

Imagine generating “vertes” at the end. The decoder must:

With a single ccc, the model can learn this sometimes, but it scales poorly with longer sequences and complex reorderings.

A “breathing room” intuition #

Think of the encoder summary vector ccc as trying to be a whole paragraph’s worth of meaning squeezed into one sticky note.

You can write a good sticky note.

But if the decoder could instead flip through the original paragraph whenever needed, it would make fewer mistakes.

That “flipping through” is exactly what attention adds.

What you should be able to do after this section #

Core Mechanic 2: Attention/Alignment (Dynamic Reading with αₜ and H) #

Why attention: the decoder needs target-step-specific context #

When producing yty_tyt​, the decoder doesn’t need the entire source equally.

So instead of a single fixed ccc, we want a different context vector $c_t$ at each decoding step.

The objects (anchored to the node’s symbols) #

You are given:

Constraints:

Interpretation:

Step 1: Compute attention scores (alignment scores) #

We first compute an unnormalized score for each source position iii.

A common pattern:

et,i=score(st−1,hi).e_{t,i} = \text{score}(s_{t-1}, h_i).et,i​=score(st−1​,hi​).

Where:

Typical scoring functions (you’ll see these in literature):

NameScore functionNotes
Dot productet,i=st−1⊤hie_{t,i} = s_{t-1}^\top h_iet,i​=st−1⊤​hi​Simple; requires same dimension
Generalet,i=st−1⊤Whie_{t,i} = s_{t-1}^\top W h_iet,i​=st−1⊤​Whi​Learnable linear map
Additive (Bahdanau)et,i=v⊤tanh⁡(Wsst−1+Whhi)e_{t,i} = v^\top \tanh(W_s s_{t-1} + W_h h_i)et,i​=v⊤tanh(Ws​st−1​+Wh​hi​)Works well with different dims

You don’t need to memorize them all right now—the pattern is what matters: compare the decoder’s needs to each encoder state.

Step 2: Softmax into attention weights αₜ #

Convert scores into a distribution over positions:

αt,i=exp⁡(et,i)∑j=1Sexp⁡(et,j).\alpha_{t,i} = \frac{\exp(e_{t,i})}{\sum_{j=1}^{S} \exp(e_{t,j})}.αt,i​=∑j=1S​exp(et,j​)exp(et,i​)​.

This is exactly where your softmax prerequisite shows up.

Numerical stability reminder (important in real implementations):

αt,i=exp⁡(et,i−m)∑jexp⁡(et,j−m),m=max⁡jet,j.\alpha_{t,i} = \frac{\exp(e_{t,i} - m)}{\sum_{j} \exp(e_{t,j}-m)},\quad m=\max_j e_{t,j}.αt,i​=∑j​exp(et,j​−m)exp(et,i​−m)​,m=jmax​et,j​.

Step 3: Build the context vector cₜ as a weighted sum of encoder states #

Now we compute:

ct=∑i=1Sαt,i hi.c_t = \sum_{i=1}^{S} \alpha_{t,i} , h_i.ct​=i=1∑S​αt,i​hi​.

Interpretation:

This makes attention feel like a soft pointer into the source.

Step 4: Use cₜ to update the decoder and produce yₜ #

A common formulation:

st=fdec(st−1,e(yt−1),ct).s_t = f_{dec}(s_{t-1}, e(y_{t-1}), c_t).st​=fdec​(st−1​,e(yt−1​),ct​).

Then output distribution:

P(yt∣y<t,x1:S)=softmax(Wost+bo).P(y_t \mid y_{<t}, x_{1:S}) = \text{softmax}(W_o s_t + b_o).P(yt​∣y<t​,x1:S​)=softmax(Wo​st​+bo​).

Sometimes the context is also fed directly into the output layer, e.g. concatenate [st;ct][s_t; c_t][st​;ct​].

Anchoring back to our 5-word example: what αₜ should look like #

Let’s label source positions:

1:I, 2:eat, 3:green, 4:apples, 5:today

A plausible alignment pattern:

So α5\alpha_5α5​ might heavily weight position 3.

Inline “alignment table” diagram #

Below is a rough alignment matrix (rows are target steps t, columns are source positions i). Darker means larger α.

t / i1 (I)2 (eat)3 (green)4 (apples)5 (today)
1 (Je)████
2 (mange)████
3 (des)███
4 (pommes)████
5 (vertes)████

This is the alignment concept: for each output step, the model forms a distribution over input positions.

Why attention improves information flow (conceptually) #

Without attention, the only path from input token xix_ixi​ to output decision at time ttt is:

xi→x_i \toxi​→ encoder computations →\to→ single summary c→c \toc→ decoder state st→yts_t \to y_tst​→yt​

With attention:

xi→hi→ct→st→ytx_i \to h_i \to c_t \to s_t \to y_txi​→hi​→ct​→st​→yt​

Now each output step has a direct, learnable channel to any encoder state.

A careful note: attention weights are not always “true explanations” #

It’s tempting to say “αₜ tells you exactly which input word caused the output.” Often it correlates with alignment, but:

So treat attention as:

What you should be able to do after this section #

Application/Connection: How Seq2seq + Attention Shows Up in Practice (and Why It Unlocks Cross-Attention) #

From “alignment” to modern architectures #

The attention we described is often called encoder–decoder attention or cross-attention:

Even if you haven’t learned Transformer math yet, the conceptual mapping is straightforward:

This is exactly why understanding seq2seq clarifies attention: it tells you what problem attention is solving.

Real tasks and what changes #

Seq2seq setups vary mainly in:

  1. What counts as the “sequence” on the input side

  2. What output vocabulary/tokenization looks like

  3. How decoding is constrained

Examples:

TaskInput sequence xOutput sequence ySpecial concerns
Translationtokenstokensreordering, morphology
Summarizationlong tokensshorter tokenscontent selection, hallucination
Speech recognitionaudio framestokenslong inputs, monotonic alignment
Image captioningregion featurestokensattention over image regions

The encoder can be anything that outputs a sequence of vectors HHH:

As long as you have HHH, the decoder can attend over it.

Decoding details that matter in applications #

Even with a good model, the way you decode changes behavior:

A typical beam objective may look like:

arg⁡max⁡y1∣y∣γ∑t=1∣y∣log⁡P(yt∣y<t,x)\arg\max_{y} \frac{1}{|y|^\gamma} \sum_{t=1}^{|y|} \log P(y_t \mid y_{<t}, x)argymax​∣y∣γ1​t=1∑∣y∣​logP(yt​∣y<t​,x)

where γ\gammaγ controls length penalty.

Exposure bias (a practical training vs inference gap) #

Training uses teacher forcing: decoder sees true yt−1\y_{t-1}^{\}yt−1\*​.

Inference uses its own previous prediction y^t−1\hat{y}_{t-1}y^​t−1​.

So the model may drift if it makes an early mistake. This issue is called exposure bias.

Mitigations you may hear about:

You don’t need to solve exposure bias now, but you should recognize it as a recurring theme in seq2seq.

Why attention is a stepping stone to Transformers #

In Transformers:

So the conceptual flow becomes:

  1. Encode source into HHH

  2. For each decoding step t, compute attention distribution αt\alpha_tαt​ over HHH

  3. Use the resulting context to predict yty_tyt​

If you understand that, the jump to “multi-head attention” is mostly an engineering/generalization step: do it several times in parallel with different learned projections.

Returning one last time to the mini-scenario #

Our translator succeeded not because it memorized the entire input in one vector, but because at each step it can ask:

When producing “vertes”, it can attend back to “green” even though that occurred earlier and has already influenced other outputs.

That dynamic reading behavior is the essence of seq2seq alignment.

What you should be able to do after this section #

Worked Examples (3) #

Compute one attention step (αₜ and cₜ) from given scores and encoder states #

Suppose the encoder produced S=5 hidden states h₁..h₅ (each 2D for simplicity):

h₁ = [1, 0]

h₂ = [0, 1]

h₃ = [1, 1]

h₄ = [2, 0]

h₅ = [0, 2]

At decoder step t=5 (trying to produce “vertes”), assume the alignment scores (unnormalized) are:

e₅ = [e₅,₁..e₅,₅] = [-1, 0, 2, 0, -2].

Compute α₅ via softmax and then compute the context vector c₅ = ∑ᵢ α₅,ᵢ hᵢ.

  1. Compute stabilized softmax.

    Let m = max(e₅) = 2.

    Compute shifted scores:

    e' = e₅ - m = [-3, -2, 0, -2, -4].

  2. Exponentiate the shifted scores:

    exp(e') = [exp(-3), exp(-2), exp(0), exp(-2), exp(-4)].

    Numerically (approx):

    exp(-3)≈0.0498

    exp(-2)≈0.1353

    exp(0)=1

    exp(-4)≈0.0183

    So:

    exp(e') ≈ [0.0498, 0.1353, 1, 0.1353, 0.0183].

  3. Normalize to get α₅.

    Sum Z = 0.0498 + 0.1353 + 1 + 0.1353 + 0.0183 = 1.3387 (approx).

    Therefore:

    α₅,₁ ≈ 0.0498 / 1.3387 = 0.0372

    α₅,₂ ≈ 0.1353 / 1.3387 = 0.1011

    α₅,₃ ≈ 1 / 1.3387 = 0.7470

    α₅,₄ ≈ 0.1353 / 1.3387 = 0.1011

    α₅,₅ ≈ 0.0183 / 1.3387 = 0.0137

    Check: they sum to ~1.0001 (rounding error).

  4. Compute context c₅ = ∑ᵢ α₅,ᵢ hᵢ.

    Write it out:

    c₅ = α₅,₁ h₁ + α₅,₂ h₂ + α₅,₃ h₃ + α₅,₄ h₄ + α₅,₅ h₅

    = 0.0372[1,0] + 0.1011[0,1] + 0.7470[1,1] + 0.1011[2,0] + 0.0137[0,2].

  5. Add component-wise.

    x-component:

    = 0.0372(1) + 0.1011(0) + 0.7470(1) + 0.1011(2) + 0.0137(0)

    = 0.0372 + 0 + 0.7470 + 0.2022 + 0

    = 0.9864.

    y-component:

    = 0.0372(0) + 0.1011(1) + 0.7470(1) + 0.1011(0) + 0.0137(2)

    = 0 + 0.1011 + 0.7470 + 0 + 0.0274

    = 0.8755.

    So c₅ ≈ [0.986, 0.876].

Insight: Because e₅,₃ is much larger than the rest, α₅ puts ~75% of its mass on source position 3. The context vector c₅ becomes close to h₃ (with smaller contributions from other positions). This is the “soft pointer” idea in numbers: attention is a learned weighted average over encoder states.

From seq2seq probability to training loss (a concrete negative log-likelihood calculation) #

Assume a tiny vocabulary of 4 tokens for the decoder: {Je, mange, pommes, vertes}. For one training example, the target sequence is y = [Je, mange].

Suppose the model outputs these probabilities:

At t=1 (predicting y₁):

P(y₁=Je | x) = 0.70

At t=2 (predicting y₂ with teacher forcing on y₁=Je):

P(y₂=mange | y₁=Je, x) = 0.20

Compute the negative log-likelihood loss for this example (natural log).

  1. Write the seq2seq factorization for this short target:

    P(y₁,y₂ | x) = P(y₁|x) · P(y₂|y₁,x).

  2. Plug in the given probabilities:

    P(y|x) = 0.70 · 0.20 = 0.14.

  3. Negative log-likelihood (NLL) is:

    L = -log P(y|x) = -log(0.14).

  4. Compute using log rules:

    -log(0.14) = -(log(14) - log(100))

    = log(100) - log(14).

    Numerically:

    log(100)=4.6052

    log(14)=2.6391

    So L ≈ 4.6052 - 2.6391 = 1.9661.

  5. Equivalently, sum token-level losses:

    L = -log 0.70 + -log 0.20

    ≈ 0.3567 + 1.6094

    = 1.9661.

Insight: Training loss decomposes across time steps: you can see exactly which step is hurting you. Here, the second token is much less probable (0.20), dominating the loss. This per-step view is also how gradients flow back through the decoder and (with attention) into the encoder states that were used to build cₜ.

Why fixed-size context can fail: a toy “compression” argument with two different sources #

Consider two different English inputs that share the same last word:

A: [I, eat, green, apples, today]

B: [I, eat, red, apples, today]

Suppose a no-attention encoder summarizes the entire source as c = h_S (the final state). The decoder must produce the correct last French adjective: vertes (green) vs rouges (red).

Explain, at a mechanistic level, why relying only on c makes this harder than using attention over H=(h₁..h₅).

  1. In the no-attention setup, every decision y_t depends on the same fixed context c.

    Formally, s_t = f_dec(s_{t-1}, e(y_{t-1}), c).

  2. The difference between inputs A and B occurs at position 3 (green vs red).

    But by the time the encoder reaches position 5, the final state h_S must contain:

    • •the subject and verb information (I/eat)
    • •the object (apples)
    • •the time adverb (today)
    • •and the color attribute (green vs red)

    all in one vector of fixed dimension d_h.

  3. During decoding, when generating the final adjective (vertes/rouges), the model must extract from c the specific attribute that was present at input position 3, potentially many steps earlier in the encoder computation.

  4. With attention, the decoder at the adjective step can compute αₜ that peaks at i=3.

    Then c_t = ∑ᵢ αₜ,ᵢ hᵢ is dominated by h₃, which is directly tied to the color token’s representation.

  5. So the decision boundary between “vertes” and “rouges” can depend more directly on h₃ rather than on whatever compressed trace of “green vs red” survived into h_S.

Insight: Attention reduces the need for perfect long-range compression. Instead of hoping the final encoder state retains every detail, the decoder can retrieve the relevant detail from the specific encoder state where it was encoded. This is especially important when the decisive information is not near the end of the source, or when the output needs to revisit earlier source tokens later in decoding.

Key Takeaways #

Common Mistakes #

Practice #

easy

You have S=3 encoder states h₁,h₂,h₃ (vectors). At some decoder step t, the attention scores are eₜ = [0, 0, 0]. What are the attention weights αₜ? What is cₜ in terms of h₁,h₂,h₃?

Hint: Softmax of equal numbers is uniform. Then cₜ is the average of the vectors.

Show solution

If eₜ = [0,0,0], then softmax gives αₜ = [1/3, 1/3, 1/3].

So:

ct=∑i=13αt,ihi=13h1+13h2+13h3.c_t = \sum_{i=1}^{3} \alpha_{t,i} h_i = \tfrac{1}{3}h_1 + \tfrac{1}{3}h_2 + \tfrac{1}{3}h_3.ct​=i=1∑3​αt,i​hi​=31​h1​+31​h2​+31​h3​.

This is the simple mean of the encoder states.

medium

Show the chain-rule factorization for P(y₁,y₂,y₃ | x₁..x_S). Then write the negative log-likelihood loss for a single training pair (x, y) using teacher forcing.

Hint: Use ∏ over t for the probability and ∑ over t for the loss; each term conditions on y_<t and x.

Show solution

Factorization:

P(y1,y2,y3∣x1:S)=P(y1∣x1:S) P(y2∣y1,x1:S) P(y3∣y1:2,x1:S).P(y_1,y_2,y_3 \mid x_{1:S}) = P(y_1 \mid x_{1:S}),P(y_2 \mid y_1, x_{1:S}),P(y_3 \mid y_{1:2}, x_{1:S}).P(y1​,y2​,y3​∣x1:S​)=P(y1​∣x1:S​)P(y2​∣y1​,x1:S​)P(y3​∣y1:2​,x1:S​).

Teacher-forcing NLL loss for one pair (x,y) of length T=3:

L=−∑t=13log⁡P(yt\∣y<t\,x1:S).\mathcal{L} = -\sum_{t=1}^{3} \log P(y_t^{\} \mid y_{<t}^{\}, x_{1:S}).L=−t=1∑3​logP(yt\*​∣y<t\*​,x1:S​).

medium

At decoder step t, you have attention weights αₜ = [0.1, 0.2, 0.7] over three encoder states h₁=[1,0], h₂=[0,1], h₃=[2,2]. Compute cₜ.

Hint: Compute cₜ = 0.1h₁ + 0.2h₂ + 0.7h₃ component-wise.

Show solution

Compute the weighted sum:

cₜ = 0.1[1,0] + 0.2[0,1] + 0.7[2,2]

First component:

= 0.1·1 + 0.2·0 + 0.7·2 = 0.1 + 0 + 1.4 = 1.5

Second component:

= 0.1·0 + 0.2·1 + 0.7·2 = 0 + 0.2 + 1.4 = 1.6

So cₜ = [1.5, 1.6].

Connections #

Unlocks and next steps:

Related conceptual neighbors you may want in a tech tree:

Quality: A (4.4/5)

← back to treebrowse all →