Loss Functions

←Back to Tech Tree

inventorycoverage

Loss Functions #

Machine LearningDifficulty: ★★★★☆Depth: 9Unlocks: 2

Measuring model error. MSE, cross-entropy, hinge loss, custom losses.

Interactive Visualization #

⏮◀◀▶▶STEP0.25x1xZOOM

t=0s

Core Concepts #

Key Symbols & Notation #

L(y, y_hat) - pointwise loss function (true label y, prediction y_hat).R_hat(theta) - empirical risk = (1/n) sum_i L(y_i, f(x_i; theta))

Essential Relationships #

Prerequisites (3) #

Machine Learning Introduction5 atomsCross-Entropy5 atomsConvex Functions4 atoms

Unlocks (2) #

RLHFlvl 5Task Discretizationlvl 5

Referenced by (4) #

Where this concept shows up in the operating-finance and personal-finance graphs.

From Business (4) #

[pipelineBusiness

The OBJECTIVE panel of the pipeline maps directly to loss functions - the formal specification of what the model is trying to minimize](/business/pipeline/)[Utility FunctionBusiness

ML's operationalization of a utility function (inverted) - MSE, cross-entropy, and custom losses each encode a different definition of what 'good' means for a model](/business/utility-function/)[competitive moatBusiness

Loss functions are the mathematical formalization of verifiers - they define what 'good' means for a model. The moat-in-verifiers thesis is essentially that defining and computing the right loss (evaluation criteria, domain-specific quality gates) is harder and more defensible than optimizing against it](/business/competitive-moat/)[Quality SystemsBusiness

Defining 'quality' for an AI system means choosing a loss function - what errors cost, which failure modes matter more, how to weight precision vs recall. Quality metrics are domain-specific loss functions.](/business/quality-systems/)

Advanced Learning Details

Graph Position #

133

Depth Cost

2

Fan-Out (ROI)

2

Bottleneck Score

9

Chain Length

Cognitive Load #

7

Atomic Elements

47

Total Elements

L3

Percentile Level

L4

Atomic Level

All Concepts (19) #

Teaching Strategy #

Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.

Training a model is mostly the art of turning “what we want” into a single number we can minimize. That single number is built from a loss function.

TL;DR:

A loss function L(y, ŷ) measures error on one example; empirical risk R̂(θ) averages it over data. Good losses encode the right behavior (e.g., calibrated probabilities vs large margins), match output types (real-valued vs probabilities), and are differentiable or subdifferentiable so gradient-based optimization can work. Common choices: MSE for regression, cross-entropy for probabilistic classification, hinge loss for margin-based classifiers, plus custom losses when the task metric isn’t directly optimizable.

What Is a Loss Function? #

Why we need a loss at all #

A learning algorithm needs a feedback signal: given a prediction ŷ and a true target y, how bad was the prediction? In machine learning we usually want something we can:

  1. 1)Compute per example (so we can aggregate over a dataset)
  2. 2)Differentiate (so we can update parameters θ via gradients)
  3. 3)Optimize (ideally with stable, predictable behavior)

A pointwise loss is a scalar function

L(y, ŷ) ∈ ℝ

that measures the error on a single example.

Given a dataset {(xᵢ, yᵢ)}ᵢ₌₁ⁿ and a model f(x; θ) producing ŷᵢ = f(xᵢ; θ), the training objective is typically the empirical risk:

R̂(θ) = (1/n) ∑ᵢ L(yᵢ, f(xᵢ; θ))

This turns learning into an optimization problem:

θ* = argmin_θ R̂(θ)

Loss vs metric (don’t conflate them) #

A metric is what you care about at evaluation time (accuracy, F1, BLEU, AUC, etc.). A loss is what you minimize during training. They often differ because many metrics are:

Loss functions are usually surrogates: smooth (or piecewise smooth) objectives that correlate with the metric but are easier to optimize.

The “shape” of a loss encodes behavior #

Losses don’t just say “wrong or right”; they encode how wrong. For example:

The choice of loss is therefore a modeling decision, not a minor detail.

Differentiability and subdifferentiability #

Most modern training uses gradients. For that, we want ∂L/∂ŷ to exist (at least almost everywhere). Some useful losses are not differentiable everywhere (e.g., hinge, absolute error). For convex piecewise-linear losses, we can use subgradients.

A quick mental model:

Even in deep learning, many “non-smooth” losses are fine because the non-differentiable points form a set of measure zero; optimization typically proceeds with valid gradients almost everywhere.

Core Mechanic 1: Pointwise Loss → Empirical Risk (and Gradients) #

Why this decomposition matters #

It’s tempting to see training as “minimize a big function.” In practice, the big function is built from many small ones.

This structure enables mini-batch training: you estimate the full gradient using a small subset of examples.

From loss to gradient: the chain rule pipeline #

Suppose the model output is ŷ = f(x; θ). Then

R̂(θ) = (1/n) ∑ᵢ L(yᵢ, f(xᵢ; θ))

Differentiate w.r.t. θ:

∇θ R̂(θ)

= ∇θ ( (1/n) ∑ᵢ L(yᵢ, f(xᵢ; θ)) )

= (1/n) ∑ᵢ ∇θ L(yᵢ, f(xᵢ; θ))

Now apply the chain rule for each term:

∇θ L(yᵢ, f(xᵢ; θ))

= (∂L/∂ŷᵢ) · ∇θ f(xᵢ; θ)

So losses matter because they determine ∂L/∂ŷ, the “error signal” that gets backpropagated.

Mini-batches: stochastic empirical risk #

For a mini-batch B of size m, we use:

R̂_B(θ) = (1/m) ∑ᵢ∈B L(yᵢ, f(xᵢ; θ))

and update parameters using ∇θ R̂_B(θ). This is an unbiased estimate of the full gradient when B is sampled uniformly.

Output type drives loss choice #

Before picking a loss, identify what ŷ represents:

Loss functions are often defined in terms of probabilities, but implemented with logits for numerical stability.

A comparison table (what you optimize changes what you get) #

Task / OutputCommon loss LWhat it encouragesTypical model output
Regression (real-valued)MSE: (1/2)(ŷ − y)²Mean prediction; large errors costlyŷ ∈ ℝ
Regression (robust)MAE:ŷ − y
Binary classification (probabilistic)Logistic / BCECalibrated probabilitieslogit z or p̂
Multiclass classification (probabilistic)Cross-entropyCorrect class probability → 1logits z, softmax
Margin-based classificationHingeLarge margin separationscore s or wx

The same dataset can yield qualitatively different models depending on whether you optimize probability calibration (cross-entropy) or margins (hinge).

Core Mechanic 2: Canonical Losses (MSE, Cross-Entropy, Hinge) and Their Geometry #

1) Mean Squared Error (MSE) #

Why MSE is so common #

MSE makes sense when:

Pointwise MSE is often written with a 1/2 for cleaner derivatives:

L(y, ŷ) = (1/2)(ŷ − y)²

Derivative wrt prediction:

∂L/∂ŷ

= ∂/∂ŷ [ (1/2)(ŷ − y)² ]

= (1/2) · 2(ŷ − y)

= (ŷ − y)

This is a very clean error signal: “prediction minus truth.”

For vector-valued regression y ∈ ℝᵈ, you often use:

L(y, ŷ) = (1/2)‖ŷy‖²

with gradient:

∇_{ŷ} L = ŷy

What can go wrong #

If your data has heavy-tailed noise or outliers, squared error can over-focus on a few extreme points. A robust alternative is MAE or Huber loss.


2) Cross-Entropy (Log Loss) #

You already know cross-entropy conceptually (H(p, q) = H(p) + KL(p‖q)). Here we connect it directly to the training loss and its gradients.

Binary cross-entropy (BCE) #

For y ∈ {0, 1} and predicted probability p̂ ∈ (0, 1):

L(y, p̂) = −[ y log(p̂) + (1 − y) log(1 − p̂) ]

This is the negative log-likelihood of a Bernoulli model.

If the model outputs a logit z ∈ ℝ with p̂ = σ(z) = 1/(1 + e^(−z)), the BCE gradient wrt z has a beautifully simple form:

First, note:

∂L/∂p̂ = −( y/p̂ − (1 − y)/(1 − p̂) )

and

∂p̂/∂z = p̂(1 − p̂)

Chain rule:

∂L/∂z

= (∂L/∂p̂)(∂p̂/∂z)

= −( y/p̂ − (1 − y)/(1 − p̂) ) · p̂(1 − p̂)

= −( y(1 − p̂) − (1 − y)p̂ )

= p̂ − y

So for logistic regression / binary classification:

∂L/∂z = p̂ − y

This mirrors MSE’s “prediction minus truth,” but in probability space.

Multiclass cross-entropy #

Let y be a one-hot vector over K classes, and = softmax(z) where:

p̂ⱼ = exp(zⱼ) / ∑ₖ exp(zₖ)

Cross-entropy loss:

L(y, ) = −∑ⱼ yⱼ log(p̂ⱼ)

If y is one-hot with true class c, this simplifies to:

L = −log(p̂_c)

A key gradient identity (used constantly in deep learning):

∂L/∂zⱼ = p̂ⱼ − yⱼ

That is: softmax + cross-entropy yields a stable gradient of “predicted distribution minus target distribution.”

Geometry: why it punishes confident mistakes #

If the true class is c but p̂_c is tiny, then −log(p̂_c) is huge. That’s exactly the point: the loss is extremely sensitive to confident wrong predictions.


3) Hinge Loss (Margin-Based) #

Why hinge exists #

Sometimes you don’t care about calibrated probabilities; you care about a decision boundary with a safety buffer. Hinge loss is a convex surrogate for 0–1 classification loss that promotes a margin.

For binary labels y ∈ {−1, +1} and a score s (often s = wx + b):

L(y, s) = max(0, 1 − y s)

Interpretation:

Subgradient #

Hinge is not differentiable at y s = 1, but it is subdifferentiable.

For the score s:

If 1 − y s < 0 ⇒ L = 0 ⇒ ∂L/∂s = 0

If 1 − y s > 0 ⇒ L = 1 − y s ⇒ ∂L/∂s = −y

At 1 − y s = 0 ⇒ subgradient set includes values between 0 and −y

This piecewise behavior is what creates the “support vectors”: only points violating the margin contribute gradients.


Choosing between cross-entropy and hinge (a practical comparison) #

CriterionCross-entropyHinge
Output interpretationProbabilities (calibration-friendly)Scores / margins
Loss on very wrong confident predictionsVery largeLinear in violation
Optimization landscapeSmooth (with softmax/sigmoid)Piecewise-linear (subgradients)
Typical useNeural nets, probabilistic modelsSVMs, margin-focused classifiers

Cross-entropy usually wins in deep learning because it works naturally with probabilistic outputs and backprop, but hinge can be useful when margin is the primary concern.

Application/Connection: Designing and Customizing Losses #

Why custom losses are common #

Real tasks rarely match “plain regression” or “plain classification.” You may have:

Custom losses are how you encode these realities into optimization.

Weighted losses (cost-sensitive learning) #

Suppose binary classification where positive examples are rare. A standard BCE might be minimized by predicting “negative” too often.

Weighted BCE:

L(y, p̂) = −[ α y log(p̂) + β (1 − y) log(1 − p̂) ]

For multiclass, use class weights w_c:

L = − w_c log(p̂_c)

This is simple and effective, but you must tune weights carefully; extreme weights can destabilize training.

Label smoothing (a small tweak, big effect) #

Instead of one-hot targets, use a softened target distribution:

ỹⱼ = (1 − ε) yⱼ + ε / K

Then train with cross-entropy to ỹ. Benefits:

It changes the gradient target from y to , preventing the model from pushing logits to extreme values.

Robust regression: Huber loss #

To interpolate between MSE (sensitive) and MAE (robust), use Huber with threshold δ:

Let r = ŷ − y.

L(r) =

Derivative wrt ŷ (i.e., wrt r):

∂L/∂ŷ =

So small errors behave like MSE, but large errors have bounded influence like MAE.

Multi-task losses: weighted sums #

If your model predicts multiple outputs (say classification + bounding box regression), you may use:

L_total = λ₁ L₁ + λ₂ L₂

This is easy to write, but hard to tune. If λ₂ is too large, the model may ignore task 1.

A useful checklist:

When the metric is non-differentiable #

Many “true” metrics are not differentiable. Common strategies:

  1. 1)Surrogate loss (most common)
  1. 2)Reinforcement learning / policy gradients
  1. 3)Direct search / black-box optimization

Numerical stability: implement losses carefully #

Cross-entropy with softmax can overflow if you compute exp(z) naively. In practice you compute:

log softmax(z_c) = z_c − log(∑ⱼ exp(zⱼ))

using the log-sum-exp trick:

log(∑ⱼ exp(zⱼ))

= a + log(∑ⱼ exp(zⱼ − a))

where a = maxⱼ zⱼ.

Similarly, binary cross-entropy is best computed from logits directly (many libraries provide a “BCEWithLogits” function).

Connection to empirical risk minimization (ERM) #

At a high level, “training” is ERM: minimize R̂(θ). The loss is your stand-in for what you actually care about.

A good loss should be:

Loss choice is one of the few levers that directly changes what gradients you get, and therefore what model you end up with.

Worked Examples (3) #

MSE for linear regression: compute empirical risk and gradient #

We have a 1D linear model ŷ = w x (no bias for simplicity). Dataset: (x₁, y₁) = (1, 2), (x₂, y₂) = (2, 3). Use pointwise MSE L = (1/2)(ŷ − y)². Compute R̂(w) and dR̂/dw at w = 1.

  1. Model predictions at w = 1:

    ŷ₁ = w x₁ = 1·1 = 1

    ŷ₂ = w x₂ = 1·2 = 2

  2. Pointwise losses:

    L₁ = (1/2)(ŷ₁ − y₁)² = (1/2)(1 − 2)² = (1/2)·1 = 0.5

    L₂ = (1/2)(ŷ₂ − y₂)² = (1/2)(2 − 3)² = (1/2)·1 = 0.5

  3. Empirical risk:

    R̂(w=1) = (1/2)(L₁ + L₂) = (1/2)(0.5 + 0.5) = 0.5

  4. Differentiate R̂(w):

    R̂(w) = (1/n) ∑ᵢ (1/2)(w xᵢ − yᵢ)²

    So

    dR̂/dw = (1/n) ∑ᵢ (1/2)·2(w xᵢ − yᵢ)·xᵢ

    = (1/n) ∑ᵢ (w xᵢ − yᵢ)xᵢ

  5. Evaluate at w = 1:

    dR̂/dw = (1/2)[(1·1 − 2)·1 + (1·2 − 3)·2]

    = (1/2)[(−1)·1 + (−1)·2]

    = (1/2)(−3)

    = −1.5

Insight: The gradient is negative, so increasing w will decrease the loss—exactly what you’d expect since both predictions were too small. Notice how MSE yields a clean residual (ŷ − y) that scales the update.

Binary cross-entropy from logits: compute loss and gradient signal #

Single example with label y = 1. Model outputs a logit z = −1. Compute p̂ = σ(z), BCE loss L(y, p̂), and ∂L/∂z.

  1. Convert logit to probability:

    p̂ = σ(z) = 1/(1 + e^(−z))

    With z = −1:

    p̂ = 1/(1 + e^(1)) ≈ 1/(1 + 2.718) ≈ 0.2689

  2. Binary cross-entropy:

    L = −[ y log(p̂) + (1 − y) log(1 − p̂) ]

    With y = 1:

    L = −log(p̂) = −log(0.2689) ≈ 1.313

  3. Gradient wrt logit uses the identity ∂L/∂z = p̂ − y:

    ∂L/∂z = 0.2689 − 1 = −0.7311

  4. Interpretation of the sign:

    Negative gradient means increasing z will reduce loss.

    Increasing z increases p̂, moving probability toward the correct label y = 1.

Insight: Cross-entropy creates a strong gradient when the model is confidently wrong. Here the model assigned low probability to the true class, so ∂L/∂z is large in magnitude.

Hinge loss: identify support vectors and subgradients #

Binary labels y ∈ {−1, +1}. Scores are s = w x (assume w is absorbed into s). We have three examples with (y, s): (1, 2.0), (1, 0.2), (−1, 0.3). Compute hinge losses and ∂L/∂s (subgradient) where applicable.

  1. Recall hinge loss:

    L(y, s) = max(0, 1 − y s)

  2. Example A: (y, s) = (1, 2.0)

    1 − y s = 1 − 1·2.0 = −1.0

    L = max(0, −1.0) = 0

    Gradient signal: ∂L/∂s = 0 (margin satisfied)

  3. Example B: (y, s) = (1, 0.2)

    1 − y s = 1 − 1·0.2 = 0.8

    L = 0.8

    Since 1 − y s > 0, use ∂L/∂s = −y = −1

  4. Example C: (y, s) = (−1, 0.3)

    Compute y s = (−1)·0.3 = −0.3

    1 − y s = 1 − (−0.3) = 1.3

    L = 1.3

    Again 1 − y s > 0, so ∂L/∂s = −y = −(−1) = +1

Insight: Only points inside the margin (or misclassified) produce non-zero gradients. This is the core reason SVMs depend on “support vectors”: many points are ignored once they are comfortably correct.

Key Takeaways #

Common Mistakes #

Practice #

easy

You have a regression target y ∈ ℝ and predictions ŷ. (a) Write the pointwise MSE and MAE losses. (b) For each, compute ∂L/∂ŷ (use subgradient for MAE).

Hint: MSE is quadratic; MAE is absolute value. Remember d|r|/dr = sign(r) for r ≠ 0, and subgradient at 0 is [−1, 1].

Show solution

(a)

MSE: L = (1/2)(ŷ − y)²

MAE: L = |ŷ − y|

(b)

For MSE:

∂L/∂ŷ = (ŷ − y)

For MAE, let r = ŷ − y:

If r > 0: ∂L/∂ŷ = +1

If r < 0: ∂L/∂ŷ = −1

If r = 0: subgradient ∂L/∂ŷ ∈ [−1, 1]

medium

Multiclass cross-entropy: Suppose K = 3, logits z = (2, 0, −1). (a) Compute softmax probabilities . (b) If the true class is c = 2 (1-indexed), compute L = −log(p̂_c).

Hint: Compute exp(zⱼ), divide by the sum. You can factor out max(zⱼ) = 2 for stability: exp(zⱼ − 2).

Show solution

(a)

Let a = max(z) = 2.

Compute exp(z − a):

exp(2−2)=exp(0)=1

exp(0−2)=exp(−2)≈0.1353

exp(−1−2)=exp(−3)≈0.0498

Sum ≈ 1 + 0.1353 + 0.0498 = 1.1851

So

p̂₁ ≈ 1/1.1851 ≈ 0.8438

p̂₂ ≈ 0.1353/1.1851 ≈ 0.1142

p̂₃ ≈ 0.0498/1.1851 ≈ 0.0420

(b)

True class c = 2 ⇒ L = −log(p̂₂)

L ≈ −log(0.1142) ≈ 2.170

medium

Hinge loss and margins: For y ∈ {−1, +1} and s = wx, consider three points with (y, s): (1, 1.2), (−1, −0.4), (−1, 2.0). (a) Compute hinge loss for each. (b) Identify which points contribute non-zero subgradients.

Hint: Compute 1 − y s. If it’s ≤ 0, loss is 0 and gradient is 0.

Show solution

(a) L = max(0, 1 − y s)

Point 1: y s = 1·1.2 = 1.2 ⇒ 1 − y s = −0.2 ⇒ L = 0

Point 2: y s = (−1)·(−0.4) = 0.4 ⇒ 1 − y s = 0.6 ⇒ L = 0.6

Point 3: y s = (−1)·(2.0) = −2.0 ⇒ 1 − y s = 3.0 ⇒ L = 3.0

(b) Non-zero subgradients occur when 1 − y s > 0:

Point 2 and Point 3 contribute; Point 1 does not.

Connections #

Next steps and related nodes:

Useful background refreshers:

Quality: A (4.4/5)

← back to treebrowse all →