Softmax Function

←Back to Tech Tree

inventorycoverage

Softmax Function #

Probability & StatisticsDifficulty: ★★★☆☆Depth: 0Unlocks: 4

A function that converts a vector of real values into a probability distribution by exponentiating and normalizing each entry; commonly used to produce attention weights. Understanding softmax behavior, numerical stability, and temperature scaling is important for interpreting attention scores.

Interactive Visualization #

⏮◀◀▶▶STEP0.25x1xZOOM

t=0s

Core Concepts #

Key Symbols & Notation #

softmax(x)_i (i-th output of softmax on vector x)T (temperature scalar)

Essential Relationships #

Unlocks (3) #

Attention Mechanismslvl 5Sequence-to-Sequence Modelinglvl 4Sequence Masking (causal and padding masks)lvl 4

Advanced Learning Details

Graph Position #

6

Depth Cost

4

Fan-Out (ROI)

1

Bottleneck Score

0

Chain Length

Cognitive Load #

6

Atomic Elements

42

Total Elements

L3

Percentile Level

L4

Atomic Level

All Concepts (15) #

Teaching Strategy #

Deep-dive lesson - accessible entry point but dense material. Use worked examples and spaced repetition.

Whenever a model needs to turn “scores” into “choices”, it needs a bridge from arbitrary real numbers to probabilities. Softmax is that bridge: it takes a vector of real-valued logits and returns a probability distribution—smoothly, differentiably, and with behavior you can control (via shifting for stability and temperature for sharpness).

TL;DR:

Softmax maps a vector x ∈ ℝⁿ to probabilities by exponentiating and normalizing: softmax(x)ᵢ = exp(xᵢ)/∑ⱼ exp(xⱼ). It’s shift-invariant (adding a constant to all logits changes nothing), so we can subtract max(x) for numerical stability. Temperature scaling softmax(x/T) controls how peaked the distribution is: low T → more confident/peaked; high T → flatter/more uniform.

What Is Softmax Function? #

Why we need it (motivation) #

In many ML systems, we compute scores for several options: which class is present, which token to attend to, which action to take. Those scores often live in ℝ: they can be negative, huge, and not constrained to sum to 1.

But downstream we often want a probability distribution:

Softmax is the standard way to convert a vector of real-valued scores (“logits”) into a probability distribution.

Definition #

Let x = (x₁, x₂, …, xₙ) be a vector of real numbers (logits).

The softmax function returns a vector p = softmax(x) where each component is

softmax⁡(x)i=exi∑j=1nexj.\operatorname{softmax}(\mathbf{x})_i = \frac{e^{x_i}}{\sum_{j=1}^{n} e^{x_j}}.softmax(x)i​=∑j=1n​exj​exi​​.

Intuition: “exponentiate then normalize” #

Softmax does two simple things:

  1. 1)Exponentiate each logit: xi↦exix_i \mapsto e^{x_i}xi​↦exi​
  1. 2)Normalize by the sum: divide by ∑ⱼ e^{xⱼ}

So softmax turns relative score gaps into relative probability mass.

A quick sanity check: it’s a probability distribution #

For each i:

∑i=1nsoftmax⁡(x)i=∑i=1nexi∑j=1nexj=∑i=1nexi∑j=1nexj=1.\sum_{i=1}^n \operatorname{softmax}(\mathbf{x})_i = \sum_{i=1}^n \frac{e^{x_i}}{\sum_{j=1}^n e^{x_j}} = \frac{\sum_{i=1}^n e^{x_i}}{\sum_{j=1}^n e^{x_j}} = 1.i=1∑n​softmax(x)i​=i=1∑n​∑j=1n​exj​exi​​=∑j=1n​exj​∑i=1n​exi​​=1.

So softmax(x) lies on the probability simplex (the set of all probability vectors).

Terminology you’ll see #

Softmax is simple to write down, but its behavior (and pitfalls) matter a lot in real models—especially numerical stability and temperature scaling, which we’ll build up next.

Core Mechanic 1: Behavior of Exponentiate-and-Normalize #

Why exponentials? #

Exponentials have two key effects:

  1. 1)Positivity: exie^{x_i}exi​ is always positive.
  2. 2)Multiplicative amplification: differences in logits turn into ratios.

A crucial identity is the ratio form:

softmax⁡(x)isoftmax⁡(x)k=exiexk=exi−xk.\frac{\operatorname{softmax}(\mathbf{x})_i}{\operatorname{softmax}(\mathbf{x})_k} = \frac{e^{x_i}}{e^{x_k}} = e^{x_i - x_k}.softmax(x)k​softmax(x)i​​=exk​exi​​=exi​−xk​.

This says softmax compares logits via their differences. If xix_ixi​ exceeds xkx_kxk​ by Δ, then i gets eΔe^{\Delta}eΔ times more probability than k.

Two-class case: softmax becomes sigmoid #

If n = 2 with logits (a, b), then

p1=eaea+eb=11+eb−a.p_1 = \frac{e^a}{e^a + e^b} = \frac{1}{1 + e^{b-a}}.p1​=ea+ebea​=1+eb−a1​.

That’s exactly a sigmoid in the logit difference (a − b). This is a nice mental model:

Invariance to units? Not quite. #

Softmax is not invariant to scaling of logits. If you multiply logits by a constant c, softmax typically becomes more or less peaked (we’ll formalize this with temperature later).

Peakedness: how “winner-take-most” emerges #

Consider three logits: x = (2, 1, 0).

Compute exponentials:

Sum ≈ 11.11

So probabilities ≈ (0.665, 0.245, 0.090).

A gap of 1 between logits becomes a factor of e ≈ 2.72 in weight; a gap of 2 becomes e² ≈ 7.39. This is why softmax can produce strong preferences even from modest logit gaps.

Geometric view: softmax outputs live on the simplex #

For n = 3, the output probabilities (p₁, p₂, p₃) satisfy p₁ + p₂ + p₃ = 1 and each pᵢ ≥ 0. That set is a 2D triangle (a simplex) embedded in 3D.

Here’s an ASCII simplex diagram to orient you:

          p3=1
           ▲
          / \
         /   \
        /  •  \   • interior points: all p_i in (0,1)
       /       \
      /         \
     /___________\
 p1=1             p2=1

Softmax maps any logits vector x to some point inside this triangle.

Visualization: temperature effect on a 2-option softmax curve #

For two options, softmax probability of option 1 depends on the logit difference d = x₁ − x₂:

p1(d;T)=11+e−d/T.p_1(d;T) = \frac{1}{1 + e^{-d/T}}.p1​(d;T)=1+e−d/T1​.

Below is an inline diagram showing how changing T changes the curve. The horizontal axis is d, vertical is p₁.

p1
1.0 |                         ............  T=0.5 (sharper)
    |                    .....
0.8 |               .....
    |           ....
0.6 |        ...                    _________  T=1 (baseline)
    |     ...                 _____
0.5 |-----+-------------------+----------------------------- d
    |     ...             ____
0.4 |        ...      ____                 - - - - - - - -  T=2 (flatter)
    |           ....__
0.2 |               .....
    |                    .....
0.0 |                         ............
      -6   -4   -2    0    2    4    6

Interpretation:

We’ll connect this to attention weights: low temperature makes attention concentrate on a few tokens; high temperature spreads it out.

Practical note: softmax is often applied row-wise #

In attention, you’ll see softmax applied to a vector of scores for a given query over all keys. If you have a matrix of scores, softmax is applied per row (or per last dimension), producing a distribution over positions for each query.

This first mechanic—exponentiate then normalize—gives the core behavior. Next we’ll cover the crucial property that makes softmax usable in real systems: shift invariance and numerical stability.

Core Mechanic 2: Shift-Invariance and Numerical Stability (the max trick) #

Why this matters #

Exponentials can overflow or underflow:

Yet logits in neural nets can easily reach magnitudes where naive exp() is unsafe. So we need a stable way to compute softmax.

Key property: shift-invariance #

Softmax is unchanged if you add the same constant c to every logit:

softmax⁡(x+c1)=softmax⁡(x).\operatorname{softmax}(\mathbf{x} + c\mathbf{1}) = \operatorname{softmax}(\mathbf{x}).softmax(x+c1)=softmax(x).

Derivation (showing work):

Let yi=xi+cy_i = x_i + cyi​=xi​+c.

softmax⁡(y)i=eyi∑jeyj=exi+c∑jexj+c=ecexiec∑jexj=exi∑jexj=softmax⁡(x)i.\operatorname{softmax}(\mathbf{y})_i = \frac{e^{y_i}}{\sum_j e^{y_j}}
= \frac{e^{x_i + c}}{\sum_j e^{x_j + c}}
= \frac{e^c e^{x_i}}{e^c \sum_j e^{x_j}}
= \frac{e^{x_i}}{\sum_j e^{x_j}}
= \operatorname{softmax}(\mathbf{x})_i.softmax(y)i​=∑j​eyj​eyi​​=∑j​exj​+cexi​+c​=ec∑j​exj​ecexi​​=∑j​exj​exi​​=softmax(x)i​.

So adding a constant doesn’t change the output probabilities.

The numerical-stability trick: subtract max #

Because of shift-invariance, we can choose c conveniently. The most common choice is

c=−max⁡ixi.c = -\max_i x_i.c=−imax​xi​.

Define m=max⁡ixim = \max_i x_im=maxi​xi​ and zi=xi−mz_i = x_i - mzi​=xi​−m.

Then max⁡izi=0\max_i z_i = 0maxi​zi​=0, so every zi≤0z_i \le 0zi​≤0.

Now compute softmax using z:

softmax⁡(x)i=exi−m∑jexj−m.\operatorname{softmax}(\mathbf{x})_i = \frac{e^{x_i - m}}{\sum_j e^{x_j - m}}.softmax(x)i​=∑j​exj​−mexi​−m​.

This is stable because:

Simple example: stability without changing meaning #

Suppose x = (1000, 1001, 999).

Naively, e^{1001} overflows.

Use max trick: m = 1001

Exponentials:

Sum ≈ 1.5032

Probabilities ≈ (0.2447, 0.6652, 0.0900)

These are perfectly reasonable—no overflow.

Visualization: shifting logits moves nothing on the simplex #

Shifting logits by a constant slides x along the direction 1 = (1,1,1,…). Softmax “forgets” that direction completely.

For n=3, imagine two different logit vectors:

They map to the exact same point (p₁, p₂, p₃) on the simplex triangle.

Here’s a conceptual diagram combining both ideas—shift vs. scale:

Simplex (n=3 probabilities)

          (0,0,1)
             ▲
            / \
           /   \
          /  A  \        A = softmax(x)
         /       \       softmax(x + 10·1) = A  (shift: unchanged)
        /    •    \      softmax(x / T) moves toward vertex or center (scale)
       /___________\
 (1,0,0)           (0,1,0)

- Shift logits: stay at the same point A.
- Scale logits (or change T): slide along a path toward a vertex (peaked) or toward center (uniform).

Implementation note (what you should do in code) #

Always compute softmax as:

  1. m=max⁡ixim = \max_i x_im=maxi​xi​

  2. zi=xi−mz_i = x_i - mzi​=xi​−m

  3. pi=exp⁡(zi)/∑jexp⁡(zj)p_i = \exp(z_i) / \sum_j \exp(z_j)pi​=exp(zi​)/∑j​exp(zj​)

This gives identical results in exact math, and far better results in floating-point.

Often you want log probabilities (e.g., for cross-entropy). Use:

log⁡softmax⁡(x)i=xi−log⁡(∑jexj).\log \operatorname{softmax}(\mathbf{x})_i = x_i - \log\left(\sum_j e^{x_j}\right).logsoftmax(x)i​=xi​−log(j∑​exj​).

Stably, compute:

Even if you don’t implement it now, it’s important conceptually: stability is not optional when exponentials are involved.

Next we’ll look at temperature scaling, which is like a controlled scaling of logits that changes the softness/hardness of the distribution.

Core Mechanic 3: Temperature Scaling (Controlling Sharpness) #

Why introduce temperature? #

Sometimes you want probabilities that are:

Temperature scaling gives a single knob T > 0 that controls this.

Definition #

Given logits x, temperature-scaled softmax is

softmax⁡T(x)i=exi/T∑jexj/T.\operatorname{softmax}_T(\mathbf{x})_i = \frac{e^{x_i/T}}{\sum_j e^{x_j/T}}.softmaxT​(x)i​=∑j​exj​/Texi​/T​.

Equivalent viewpoint: dividing by T is like multiplying logits by α=1/T\alpha = 1/Tα=1/T.

Limiting behavior (important intuition) #

Let p(T) = softmax(x/T).

  1. As T → 0⁺:
  1. As T → ∞:

You can see this via differences: ratios are

pipk=e(xi−xk)/T.\frac{p_i}{p_k} = e^{(x_i-x_k)/T}.pk​pi​​=e(xi​−xk​)/T.

Temperature in attention #

In dot-product attention, scores often look like

si=q⋅kid.s_i = \frac{\mathbf{q} \cdot \mathbf{k}_i}{\sqrt{d}}.si​=d​q⋅ki​​.

Then attention weights are

ai=softmax⁡(s)i.a_i = \operatorname{softmax}(\mathbf{s})_i.ai​=softmax(s)i​.

The $1/\sqrt{d}$ factor plays a temperature-like role: it prevents dot products from growing too large with dimension d (which would make softmax too peaked too early).

Visual: how T moves you on the simplex (n=3) #

Take logits x = (2, 1, 0). Consider three temperatures.

Compute probabilities:

On the simplex triangle, these three points lie along a path from the center-ish region toward the vertex (1,0,0) as T decreases.

Calibration note (probabilities vs confidence) #

Temperature scaling is also used for calibration: you can adjust T (often on a validation set) so predicted probabilities better match empirical accuracy.

This is a big reason softmax is interpreted carefully: the raw logits contain information beyond just the top class.

At this point you know:

Next we connect it directly to attention mechanisms, masking, and how to interpret attention scores.

Application/Connection: Softmax in Attention, Masking, and Interpretation #

Softmax as “attention allocator” #

In attention, you compute a score for each key/value relative to a query. These scores are logits s.

Softmax turns them into weights a that sum to 1:

ai=softmax⁡(s)i.a_i = \operatorname{softmax}(\mathbf{s})_i.ai​=softmax(s)i​.

Then the attention output is a weighted sum:

Attn(q)=∑iaivi.\text{Attn}(\mathbf{q}) = \sum_i a_i \mathbf{v}_i.Attn(q)=i∑​ai​vi​.

So softmax is the mechanism that converts similarities into a convex combination of values.

How to read attention weights #

Because ai≥0a_i \ge 0ai​≥0 and ∑ᵢ aᵢ = 1:

But interpret carefully:

Masking: forcing probabilities to ignore some positions #

In sequence models you often must prevent attending to:

The standard technique: add a large negative number (−∞ in math; a big negative constant in practice) to masked logits before softmax.

Let mask mᵢ be 0 for allowed, and −∞ for disallowed. Define

si′=si+mi.s'_i = s_i + m_i.si′​=si​+mi​.

Then

This works because softmax only cares about exponentials; setting a logit to −∞ removes it from the sum.

Numerical detail: choose a safe “−∞” #

In floating point, you use something like −1e9 (float32) or a framework-provided mask fill value.

Connection to cross-entropy and learning signals #

Softmax is commonly paired with cross-entropy loss.

If the true class is k and predicted probabilities are pᵢ, then

L=−log⁡pk.\mathcal{L} = -\log p_k.L=−logpk​.

When the model assigns low probability to the correct class, the loss is large.

A key internal quantity is log-sum-exp:

pk=exk∑jexj⇒−log⁡pk=−xk+log⁡(∑jexj).p_k = \frac{e^{x_k}}{\sum_j e^{x_j}} \quad\Rightarrow\quad -\log p_k = -x_k + \log\left(\sum_j e^{x_j}\right).pk​=∑j​exj​exk​​⇒−logpk​=−xk​+log(j∑​exj​).

This is one reason stable log-softmax implementations are so common.

Interpreting logits vs probabilities #

Logits contain “un-normalized evidence.” Softmax converts them to probabilities, but:

Summary of when softmax is the right tool #

Use softmax when you need:

Avoid or reconsider when you need:

With these connections, you’re ready to use softmax as a dependable building block for attention mechanisms, masking, and sequence-to-sequence models.

Worked Examples (3) #

Compute softmax probabilities (and see how gaps matter) #

Let x = (2, 1, 0). Compute softmax(x) exactly as exponentiate-and-normalize.

  1. Write the definition:

    softmax(x)ᵢ = exp(xᵢ) / (exp(2) + exp(1) + exp(0)).

  2. Compute exponentials:

    exp(2) ≈ 7.389,

    exp(1) ≈ 2.718,

    exp(0) = 1.

  3. Sum them:

    S = 7.389 + 2.718 + 1 = 11.107.

  4. Normalize each component:

    p₁ = 7.389 / 11.107 ≈ 0.665,

    p₂ = 2.718 / 11.107 ≈ 0.245,

    p₃ = 1 / 11.107 ≈ 0.090.

  5. Check the distribution sums to 1 (up to rounding):

    0.665 + 0.245 + 0.090 = 1.000.

Insight: Softmax cares about differences: (2 vs 1 vs 0) becomes roughly (0.665, 0.245, 0.090). A 1-point logit gap turns into a factor of e ≈ 2.72 in probability mass before normalization.

Numerical stability: naive softmax overflows, max-shifted softmax works #

Let x = (1000, 1001, 999). Show why naive computation fails and compute softmax stably using the max trick.

  1. Naive approach would require exp(1000), exp(1001), exp(999).

    In float32/float64, exp(1001) overflows (becomes ∞), making the result undefined (∞/∞).

  2. Use shift-invariance:

    Let m = max(x) = 1001.

    Define zᵢ = xᵢ − m, so z = (-1, 0, -2).

  3. Compute exponentials safely:

    exp(-1) ≈ 0.3679,

    exp(0) = 1,

    exp(-2) ≈ 0.1353.

  4. Sum:

    S = 0.3679 + 1 + 0.1353 = 1.5032.

  5. Normalize:

    p₁ = 0.3679 / 1.5032 ≈ 0.2447,

    p₂ = 1 / 1.5032 ≈ 0.6652,

    p₃ = 0.1353 / 1.5032 ≈ 0.0900.

Insight: Subtracting max(x) doesn’t change softmax outputs, but it bounds the largest exponent at 1, preventing overflow and improving precision.

Temperature scaling changes sharpness without changing the argmax #

Let x = (2, 1, 0). Compute softmax(x/T) for T = 2, 1, 0.5 and compare.

  1. Case T = 2:

    x/2 = (1, 0.5, 0).

    exp values: (2.718, 1.649, 1).

    Sum S ≈ 5.367.

    Probabilities: (0.506, 0.307, 0.186).

  2. Case T = 1:

    Already computed: (0.665, 0.245, 0.090).

  3. Case T = 0.5:

    x/0.5 = (4, 2, 0).

    exp values: (54.598, 7.389, 1).

    Sum S ≈ 62.987.

    Probabilities: (0.867, 0.117, 0.016).

  4. Compare:

    As T decreases, p₁ increases and the distribution becomes more peaked.

    The argmax remains index 1 for all T > 0 (since scaling by 1/T preserves order).

Insight: Temperature doesn’t change which logit is largest, but it strongly affects how much probability mass concentrates on the top options—critical for attention sharpness and calibration.

Key Takeaways #

Common Mistakes #

Practice #

easy

Compute softmax(x) for x = (0, 0, 0, 0). What distribution do you get and why?

Hint: All exponentials are equal; normalize by their sum.

Show solution

exp(0)=1 for each entry, sum = 4, so each probability is 1/4. Softmax returns the uniform distribution when all logits are equal.

medium

Show (algebraically) that softmax is shift-invariant: softmax(x + c1) = softmax(x).

Hint: Factor e^c out of numerator and denominator.

Show solution

Let yᵢ = xᵢ + c. Then softmax(y)ᵢ = e^{xᵢ+c}/∑ⱼ e^{xⱼ+c} = (e^c e^{xᵢ})/(e^c ∑ⱼ e^{xⱼ}) = e^{xᵢ}/∑ⱼ e^{xⱼ} = softmax(x)ᵢ.

hard

Let x = (3, 1, -1). Compute softmax(x/T) for T = 1 and T = 2 (use the max trick if you want). Which is more peaked? Explain using ratios.

Hint: Compare p₁/p₂ = exp((x₁-x₂)/T).

Show solution

For T=1: exponentials are (e^3, e^1, e^{-1}) ≈ (20.085, 2.718, 0.368). Sum ≈ 23.171. So p ≈ (0.867, 0.117, 0.016).

For T=2: logits are (1.5, 0.5, -0.5). exponentials ≈ (4.482, 1.649, 0.607). Sum ≈ 6.738. So p ≈ (0.665, 0.245, 0.090).

T=1 is more peaked. Ratio explanation: p₁/p₂ = exp((3-1)/T) = exp(2/T). For T=1 ratio is e^2≈7.39; for T=2 ratio is e^1≈2.72, so the top class dominates more at lower T.

Connections #

Quality: A (4.5/5)

← back to treebrowse all →