Logistic Regression

←Back to Tech Tree

inventorycoverage

Logistic Regression #

Machine LearningDifficulty: ★★★☆☆Depth: 9Unlocks: 12

Binary classification. Sigmoid function, cross-entropy loss.

Interactive Visualization #

⏮◀◀▶▶STEP0.25x1xZOOM

t=0s

Core Concepts #

Key Symbols & Notation #

w - parameter (weight) vector (including bias)

Essential Relationships #

Prerequisites (3) #

Machine Learning Introduction5 atomsMaximum Likelihood Estimation6 atomsGradient Descent6 atoms

Unlocks (1) #

Neural Networkslvl 4

Referenced by (2) #

Where this concept shows up in the operating-finance and personal-finance graphs.

From Business (2) #

[Credit UtilizationBusiness

FICO models are logistic regressions where utilization ratio is a key feature; understanding how a continuous ratio maps through a sigmoid to a probability of default clarifies why utilization thresholds (30%, 10%) produce nonlinear score impacts.](/business/credit-utilization/)[ChurnBusiness

Churn prediction is the canonical binary classification problem. Logistic regression (sigmoid output = P(churn | features), cross-entropy loss, decision threshold) is the textbook first model taught for churn and remains a production baseline.](/business/churn/)

Advanced Learning Details

Graph Position #

112

Depth Cost

12

Fan-Out (ROI)

5

Bottleneck Score

9

Chain Length

Cognitive Load #

6

Atomic Elements

35

Total Elements

L2

Percentile Level

L4

Atomic Level

All Concepts (14) #

Teaching Strategy #

Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.

Logistic regression is the “hello world” of modern classification: a linear score turned into a probability, trained by a loss that directly matches how Bernoulli (yes/no) data is generated. It’s simple enough to fully understand, but rich enough to connect straight into neural networks.

TL;DR:

Logistic regression models P(y = 1 | x) = σ(w·x + b). Train w (and b) by minimizing binary cross-entropy (negative log-likelihood). The gradient has a clean form: ∇ = (ŷ − y)x (and bias gradient ŷ − y), making it easy to optimize with gradient descent.

What Is Logistic Regression? #

The problem it solves #

In binary classification, each example has features x ∈ ℝᵈ and a label y ∈ {0, 1}. You want a model that, given x, outputs a probability that the label is 1:

Many models can produce a hard decision (0/1), but logistic regression is designed to produce a calibrated probability in [0, 1].

Why we don’t just use linear regression #

A linear model like w·x + b can be any real number: negative, > 1, etc. That’s not a valid probability.

We need two ingredients:

  1. 1)A linear predictor (score):

z = w·x + b

This is the raw “evidence” for the positive class.

  1. 2)A squashing function that maps ℝ → [0, 1].

Logistic regression chooses the sigmoid (logistic) function:

σ(z) = 1 / (1 + e^(−z))

So the model is:

ŷ = P(y = 1 | x) = σ(w·x + b)

Intuition: score → probability #

The score z measures where x lies relative to a hyperplane.

w·x + b = 0

So logistic regression is a linear classifier in geometry, but a probabilistic model in output.

Odds and log-odds (why sigmoid is a natural choice) #

A key reason logistic regression is so standard is that it models the log-odds as linear.

Define odds:

odds = P(y=1|x) / P(y=0|x) = p / (1 − p)

Log-odds (logit):

logit(p) = log(p / (1 − p))

Logistic regression assumes:

log(p / (1 − p)) = w·x + b

Solve for p:

Let z = w·x + b.

p / (1 − p) = e^z

p = e^z (1 − p)

p = e^z − e^z p

p + e^z p = e^z

p(1 + e^z) = e^z

p = e^z / (1 + e^z) = 1 / (1 + e^(−z)) = σ(z)

This shows sigmoid isn’t arbitrary: it’s what you get when you say “log-odds are linear in features.”

Notation note: include bias in w #

Sometimes we fold the bias into the weight vector by adding a constant feature x₀ = 1.

Define extended feature vector = [1, x₁, …, x_d] and = [b, w₁, …, w_d]. Then:

z = ·

This can simplify implementations and derivations.

Summary #

Logistic regression is:

Core Mechanic 1: Linear Predictor and the Sigmoid #

Why start with a linear predictor? #

The linear predictor is the simplest way to combine features:

z = w·x + b = ∑ⱼ wⱼ xⱼ + b

Motivation:

Geometry: a hyperplane decision boundary #

The set of points where the model is indifferent (predicts 0.5) is where ŷ = 0.5.

Since σ(0) = 0.5, we have:

ŷ = 0.5 ⇔ z = 0 ⇔ w·x + b = 0

That equation describes a hyperplane.

Predicted class often uses a threshold:

predict 1 if ŷ ≥ 0.5 (equivalently z ≥ 0)

Why the sigmoid specifically? #

We want a function with:

Sigmoid has these properties.

Key values:

Sensitivity: sigmoid derivative #

Training needs gradients. Sigmoid has a famously convenient derivative.

Let p = σ(z) = 1 / (1 + e^(−z)).

Differentiate:

p = (1 + e^(−z))^(−1)

∂p/∂z = −1 · (1 + e^(−z))^(−2) · ∂/∂z (1 + e^(−z))

∂/∂z (1 + e^(−z)) = −e^(−z)

So:

∂p/∂z = (1 + e^(−z))^(−2) · e^(−z)

Now rewrite in terms of p:

p = 1 / (1 + e^(−z))

1 − p = e^(−z) / (1 + e^(−z))

Therefore:

p(1 − p) = [1 / (1 + e^(−z))] · [e^(−z) / (1 + e^(−z))] = e^(−z) / (1 + e^(−z))²

Thus:

∂p/∂z = p(1 − p)

This compact form is one reason logistic regression is so convenient.

Interpretability: weight signs and magnitudes #

Because log-odds are linear:

log(p/(1−p)) = w·x + b

Each weight wⱼ has a direct interpretation:

Caution: interpretation depends on feature scaling. If one feature is measured in large units, its weight will tend to be smaller.

A tiny comparison table #

ComponentLinear regression (for y ∈ ℝ)Logistic regression (for y ∈ {0,1})
Scorew·x + bw·x + b
Outputŷ = scoreŷ = σ(score) ∈ (0,1)
Typical losssquared errorbinary cross-entropy
Probabilistic meaningGaussian noise assumptionBernoulli likelihood

This sets up the next step: choosing a loss that matches Bernoulli labels.

Core Mechanic 2: Binary Cross-Entropy as Negative Log-Likelihood #

Why we need a special loss #

For classification, we don’t just want “close numeric values.” We want:

Binary cross-entropy (BCE) comes directly from maximum likelihood estimation for a Bernoulli model.

Bernoulli model for labels #

Assume for each input x, the label y is drawn as:

P(y = 1 | x) = p

P(y = 0 | x) = 1 − p

with p = σ(z) and z = w·x + b.

The Bernoulli probability mass function can be written compactly as:

P(y | x) = pʸ (1 − p)^(1−y)

because:

Likelihood for a dataset #

Given N i.i.d. examples {(xᵢ, yᵢ)}:

L(w, b) = ∏ᵢ pᵢ^(yᵢ) (1 − pᵢ)^(1−yᵢ)

where pᵢ = σ(w·xᵢ + b).

Maximizing a product is awkward, so take logs:

log L = ∑ᵢ [ yᵢ log pᵢ + (1 − yᵢ) log(1 − pᵢ) ]

Maximum likelihood is equivalent to minimizing negative log-likelihood:

J(w, b) = − log L = − ∑ᵢ [ yᵢ log pᵢ + (1 − yᵢ) log(1 − pᵢ) ]

Often we average over N:

J = (1/N) ∑ᵢ ℓᵢ

with per-example loss:

ℓ = −[ y log p + (1 − y) log(1 − p) ]

That is the binary cross-entropy loss.

Why BCE behaves the way we want #

Consider one example.

So BCE strongly penalizes confident mistakes.

Gradient: the clean “prediction minus label” form #

This is the workhorse result that makes training simple.

For one example:

p = σ(z), z = w·x + b

ℓ = −[ y log p + (1 − y) log(1 − p) ]

We compute ∂ℓ/∂z.

Step 1: derivative of ℓ with respect to p:

∂ℓ/∂p = −[ y · (1/p) + (1 − y) · (−1/(1 − p)) ]

∂ℓ/∂p = − y/p + (1 − y)/(1 − p)

Step 2: chain rule with ∂p/∂z = p(1 − p):

∂ℓ/∂z = (∂ℓ/∂p)(∂p/∂z)

∂ℓ/∂z = ( − y/p + (1 − y)/(1 − p) ) · p(1 − p)

Distribute p(1 − p):

∂ℓ/∂z = −y(1 − p) + (1 − y)p

∂ℓ/∂z = −y + yp + p − yp

∂ℓ/∂z = p − y

So the derivative w.r.t. the score is:

∂ℓ/∂z = (p − y) = (ŷ − y)

Now apply z = w·x + b:

∂z/∂w = x

∂z/∂b = 1

Thus:

∇_w ℓ = (ŷ − y)x

∂ℓ/∂b = (ŷ − y)

For the full dataset (averaged):

∇_w J = (1/N) ∑ᵢ (ŷᵢ − yᵢ)**xᵢ

∂J/∂b = (1/N) ∑ᵢ (ŷᵢ − yᵢ)

This is the key computational loop: predict p, compute error (p − y), accumulate gradients.

Convexity (a practical perk) #

For standard logistic regression (no hidden layers), the BCE objective is convex in (w, b). That means:

This is a major difference from neural networks, where the objective is non-convex.

Application/Connection: Training, Decision Thresholds, and the Bridge to Neural Networks #

Training with gradient descent (putting the pieces together) #

A typical training step:

  1. 1)Compute zᵢ = w·xᵢ + b
  2. 2)Compute ŷᵢ = σ(zᵢ)
  3. 3)Compute gradients:
  1. 4)Update parameters (learning rate η):

ww − η g_w

b ← b − η g_b

Because you already know gradient descent, the main learning here is: BCE + sigmoid makes the gradient become “prediction minus label.”

Decision thresholds and costs #

The model outputs a probability ŷ. Turning it into a label requires a threshold t.

But in many real applications, false positives and false negatives have different costs.

Examples:

So logistic regression naturally supports probability-based decision-making.

Evaluation metrics (quick orientation) #

Accuracy is not always enough, especially with class imbalance.

Common choices:

BCE is a natural metric because it evaluates probability quality, not just hard labels.

Regularization (brief but important) #

To reduce overfitting, add a penalty on w.

L2 regularization (ridge):

J_reg = J + (λ/2)‖w‖²

Gradient adds:

∇_w J_reg = ∇_w J + λw

(Usually the bias b is not regularized.)

L1 regularization (lasso) encourages sparsity, but its gradient uses subgradients and optimization needs more care.

Numerical stability: logits and “BCE with logits” #

Directly computing log(σ(z)) can cause issues when z is very large in magnitude.

In practice, libraries use a stable form often called binary cross-entropy with logits, where you pass z (the logit) directly.

This is a practical detail, but it matters for robust training.

Connection to neural networks #

Logistic regression is a 1-layer neural network:

When you later learn neural networks, you’ll generalize the linear layer to multiple layers and nonlinearities. The final layer for binary classification often remains a sigmoid (or a 2-class softmax), and the loss remains cross-entropy.

So mastering logistic regression means you already understand:

You’re standing right at the entrance to Neural Networks.

Worked Examples (3) #

Compute a prediction and interpret it (score → probability → decision) #

Let w = (0.8, −0.4), b = −0.2. For input x = (2, 1), compute z, ŷ = σ(z), and the predicted class with threshold 0.5.

  1. Compute the linear score:

    z = w·x + b

    = (0.8)(2) + (−0.4)(1) + (−0.2)

    = 1.6 − 0.4 − 0.2

    = 1.0

  2. Map score to probability:

    ŷ = σ(z) = 1 / (1 + e^(−1.0))

    ≈ 1 / (1 + 0.3679)

    ≈ 0.7311

  3. Apply threshold t = 0.5:

    ŷ ≈ 0.7311 ≥ 0.5 ⇒ predict class 1

Insight: The decision boundary is z = 0. Here z = 1 is on the positive side, and sigmoid turns that margin into a probability (about 73%).

Compute binary cross-entropy loss for one example #

Suppose the model predicts ŷ = 0.9 for an example whose true label is y = 1. Then compute the per-example BCE loss. Repeat for a confident wrong prediction ŷ = 0.01 when y = 1.

  1. If y = 1, BCE loss is:

    ℓ = −[ y log ŷ + (1 − y) log(1 − ŷ) ]

    = −[ 1 · log(0.9) + 0 · log(0.1) ]

    = −log(0.9)

    ≈ 0.1053

  2. For ŷ = 0.01 with y = 1:

    ℓ = −log(0.01)

    ≈ 4.6052

Insight: BCE is gentle when you’re confidently correct, but extremely harsh when you’re confidently wrong—exactly what you want for probabilistic classification.

One gradient descent step on a single example (see the (ŷ − y)**x** pattern) #

Single training example: x = (3, −1), y = 0. Start with w = (0, 0), b = 0. Use learning rate η = 0.1. Do one gradient descent update.

  1. Compute score:

    z = w·x + b = 0

    So ŷ = σ(0) = 0.5

  2. Compute gradients for one example:

    ∇_w ℓ = (ŷ − y)x

    Here (ŷ − y) = 0.5 − 0 = 0.5

    So:

    ∇_w ℓ = 0.5 · (3, −1) = (1.5, −0.5)

    Bias gradient:

    ∂ℓ/∂b = (ŷ − y) = 0.5

  3. Update parameters:

    ww − η ∇_w

    = (0, 0) − 0.1(1.5, −0.5)

    = (−0.15, 0.05)

    b ← b − η(∂ℓ/∂b)

    = 0 − 0.1(0.5)

    = −0.05

Insight: Because y = 0 but ŷ = 0.5 is too high, (ŷ − y) is positive, so the update moves w and b in a direction that reduces the score z on this example next time.

Key Takeaways #

Common Mistakes #

Practice #

easy

Given w = (1, −2), b = 0.5, and x = (1, 2), compute z, ŷ = σ(z), and the predicted label using threshold 0.5.

Hint: Compute z = 1·1 + (−2)·2 + 0.5, then apply σ(z). If z < 0 then ŷ < 0.5.

Show solution

z = (1)(1) + (−2)(2) + 0.5 = 1 − 4 + 0.5 = −2.5.

ŷ = σ(−2.5) = 1/(1+e^(2.5)) ≈ 1/(1+12.182) ≈ 0.0759.

Since ŷ < 0.5, predict label 0.

medium

Show that for BCE with sigmoid output, the derivative with respect to the logit z is ∂ℓ/∂z = ŷ − y.

Hint: Use ℓ = −[ y log ŷ + (1 − y) log(1 − ŷ) ], then chain rule: (∂ℓ/∂ŷ)(∂ŷ/∂z). Recall ∂σ/∂z = ŷ(1 − ŷ).

Show solution

Let ŷ = σ(z).

ℓ = −[ y log ŷ + (1 − y) log(1 − ŷ) ].

Compute:

∂ℓ/∂ŷ = −[ y(1/ŷ) + (1 − y)(−1/(1 − ŷ)) ] = −y/ŷ + (1 − y)/(1 − ŷ).

Also ∂ŷ/∂z = ŷ(1 − ŷ).

So:

∂ℓ/∂z = (−y/ŷ + (1 − y)/(1 − ŷ))·ŷ(1 − ŷ)

= −y(1 − ŷ) + (1 − y)ŷ

= −y + yŷ + ŷ − yŷ

= ŷ − y.

hard

One-step update with two examples (mini-batch): Start w = (0, 0), b = 0, η = 0.2. Examples: (x₁=(1,0), y₁=1) and (x₂=(0,1), y₂=0). Use the average gradient over the two examples to update w and b once.

Hint: With w = 0 and b = 0, both logits are 0 so both predictions are 0.5. Compute (ŷ − y) for each example, then average gradients: (1/N)∑(ŷ − y)x.

Show solution

Initial: z₁ = 0, z₂ = 0 ⇒ ŷ₁ = ŷ₂ = 0.5.

Errors:

(ŷ₁ − y₁) = 0.5 − 1 = −0.5

(ŷ₂ − y₂) = 0.5 − 0 = 0.5

Average weight gradient:

∇_w J = (1/2)[(−0.5)(1,0) + (0.5)(0,1)]

= (1/2)[(−0.5, 0) + (0, 0.5)]

= (−0.25, 0.25)

Average bias gradient:

∂J/∂b = (1/2)[(−0.5) + (0.5)] = 0

Update:

ww − η∇_wJ = (0,0) − 0.2(−0.25, 0.25) = (0.05, −0.05)

b ← 0 − 0.2(0) = 0.

Connections #

Quality: A (4.5/5)

← back to treebrowse all →