Logistic Regression #

Machine LearningDifficulty: ★★★☆☆Depth: 9Unlocks: 12

Binary classification. Sigmoid function, cross-entropy loss.

Interactive Visualization #

⏮◀◀▶▶STEP0.25x1xZOOM

t=0s

Core Concepts #

-Linear predictor (score): weighted sum of features plus bias, i.e. the model's raw score for an input
-Sigmoid (logistic) function: maps a real-valued score to a probability in [0,1]
-Binary cross-entropy loss: negative log-likelihood for Bernoulli labels (per-example loss)

Key Symbols & Notation #

w - parameter (weight) vector (including bias)

Essential Relationships #

-Predicted probability p is sigmoid of the linear score: p = sigmoid(w dot x)
-Per-example loss is the negative log-likelihood (cross-entropy): L = -[ y*log(p) + (1-y)*log(1-p) ]

Prerequisites (3) #

Machine Learning Introduction5 atoms Maximum Likelihood Estimation6 atoms Gradient Descent6 atoms

Unlocks (1) #

Neural Networkslvl 4

Referenced by (2) #

Where this concept shows up in the operating-finance and personal-finance graphs.

From Business (2) #

[Credit UtilizationBusiness

FICO models are logistic regressions where utilization ratio is a key feature; understanding how a continuous ratio maps through a sigmoid to a probability of default clarifies why utilization thresholds (30%, 10%) produce nonlinear score impacts.](/business/credit-utilization/)[ChurnBusiness

Churn prediction is the canonical binary classification problem. Logistic regression (sigmoid output = P(churn | features), cross-entropy loss, decision threshold) is the textbook first model taught for churn and remains a production baseline.](/business/churn/)

Advanced Learning Details

Graph Position #

112

Depth Cost

Fan-Out (ROI)

Bottleneck Score

Chain Length

Cognitive Load #

Atomic Elements

Total Elements

Percentile Level

Atomic Level

All Concepts (14) #

- Sigmoid (logistic) function as a squashing map σ(z)=1/(1+e^{-z}) that converts a real-valued score to a probability in (0,1)
- Linear predictor (score) z = w^T x + b (or w^T x when bias implicit) that combines inputs with parameters
- Probabilistic binary model p(y=1 | x; w) = σ(w^T x) (Bernoulli conditional model parametrized by w)
- Per-example Bernoulli likelihood for binary labels under the sigmoid model
- Cross-entropy loss (negative log-likelihood) for a single example: ℓ(x,y;w) = -[y log p + (1-y) log(1-p)] where p = σ(w^T x)
- Overall empirical loss / cost J(w) as the average (or sum) of per-example cross-entropies over the dataset
- Decision rule for classification from predicted probability (e.g., predict class 1 if p≥0.5)
- Decision boundary: the geometric set where the model is indifferent (w^T x + b = 0) - a linear separator in input space
- Log-odds (logit): log(p/(1-p)) equals the linear predictor w^T x + b - interpretation linking linear score to odds
- Interpretation of weights: a unit change in a feature shifts the log-odds by the corresponding weight
- Vectorized notation for model predictions and loss using data matrix X and label vector y (e.g., σ(Xw), J(w) = -∑[y log σ(Xw) + (1-y) log(1-σ(Xw))])
- Sigmoid derivative identity σ'(z) = σ(z) (1 - σ(z)), used in computing gradients
- Analytic form of the gradient for one example: ∂ℓ/∂w = (σ(w^T x) - y) x (and analogous vectorized form)
- Convexity of the logistic (cross-entropy) loss with respect to parameters w (so optimization has no non-global local minima)

Teaching Strategy #

Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.

Logistic regression is the “hello world” of modern classification: a linear score turned into a probability, trained by a loss that directly matches how Bernoulli (yes/no) data is generated. It’s simple enough to fully understand, but rich enough to connect straight into neural networks.

TL;DR:

Logistic regression models P(y = 1 | x) = σ(w·x + b). Train w (and b) by minimizing binary cross-entropy (negative log-likelihood). The gradient has a clean form: ∇ = (ŷ − y)x (and bias gradient ŷ − y), making it easy to optimize with gradient descent.

What Is Logistic Regression? #

The problem it solves #

In binary classification, each example has features x ∈ ℝᵈ and a label y ∈ {0, 1}. You want a model that, given x, outputs a probability that the label is 1:

•Output near 1 ⇒ “very likely positive”
•Output near 0 ⇒ “very likely negative”

Many models can produce a hard decision (0/1), but logistic regression is designed to produce a calibrated probability in [0, 1].

Why we don’t just use linear regression #

A linear model like w·x + b can be any real number: negative, > 1, etc. That’s not a valid probability.

We need two ingredients:

1)A linear predictor (score):

z = w·x + b

This is the raw “evidence” for the positive class.

2)A squashing function that maps ℝ → [0, 1].

Logistic regression chooses the sigmoid (logistic) function:

σ(z) = 1 / (1 + e^(−z))

So the model is:

ŷ = P(y = 1 | x) = σ(w·x + b)

Intuition: score → probability #

The score z measures where x lies relative to a hyperplane.

•Decision boundary is where z = 0

w·x + b = 0

•If z is large and positive, σ(z) ≈ 1
•If z is large and negative, σ(z) ≈ 0

So logistic regression is a linear classifier in geometry, but a probabilistic model in output.

Odds and log-odds (why sigmoid is a natural choice) #

A key reason logistic regression is so standard is that it models the log-odds as linear.

Define odds:

odds = P(y=1|x) / P(y=0|x) = p / (1 − p)

Log-odds (logit):

logit(p) = log(p / (1 − p))

Logistic regression assumes:

log(p / (1 − p)) = w·x + b

Solve for p:

Let z = w·x + b.

p / (1 − p) = e^z

p = e^z (1 − p)

p = e^z − e^z p

p + e^z p = e^z

p(1 + e^z) = e^z

p = e^z / (1 + e^z) = 1 / (1 + e^(−z)) = σ(z)

This shows sigmoid isn’t arbitrary: it’s what you get when you say “log-odds are linear in features.”

Notation note: include bias in w #

Sometimes we fold the bias into the weight vector by adding a constant feature x₀ = 1.

Define extended feature vector x̃ = [1, x₁, …, x_d] and w̃ = [b, w₁, …, w_d]. Then:

z = w̃·x̃

This can simplify implementations and derivations.

Summary #

Logistic regression is:

•A linear scoring function z = w·x + b
•A probabilistic output ŷ = σ(z)
•A training objective that matches Bernoulli labels via maximum likelihood

Core Mechanic 1: Linear Predictor and the Sigmoid #

Why start with a linear predictor? #

The linear predictor is the simplest way to combine features:

z = w·x + b = ∑ⱼ wⱼ xⱼ + b

Motivation:

•It’s interpretable: each feature contributes additively.
•It’s scalable: works well in high dimensions.
•It’s a strong baseline: many problems are close to linearly separable after good feature engineering.

Geometry: a hyperplane decision boundary #

The set of points where the model is indifferent (predicts 0.5) is where ŷ = 0.5.

Since σ(0) = 0.5, we have:

ŷ = 0.5 ⇔ z = 0 ⇔ w·x + b = 0

That equation describes a hyperplane.

•w is perpendicular (normal) to the hyperplane.
•b shifts the hyperplane.

Predicted class often uses a threshold:

predict 1 if ŷ ≥ 0.5 (equivalently z ≥ 0)

Why the sigmoid specifically? #

We want a function with:

•Output in (0, 1)
•Smooth, differentiable (for gradient descent)
•Monotonic increasing (higher score ⇒ higher probability)

Sigmoid has these properties.

Key values:

•σ(0) = 1/2
•σ(z) → 1 as z → +∞
•σ(z) → 0 as z → −∞

Sensitivity: sigmoid derivative #

Training needs gradients. Sigmoid has a famously convenient derivative.

Let p = σ(z) = 1 / (1 + e^(−z)).

Differentiate:

p = (1 + e^(−z))^(−1)

∂p/∂z = −1 · (1 + e^(−z))^(−2) · ∂/∂z (1 + e^(−z))

∂/∂z (1 + e^(−z)) = −e^(−z)

So:

∂p/∂z = (1 + e^(−z))^(−2) · e^(−z)

Now rewrite in terms of p:

p = 1 / (1 + e^(−z))

1 − p = e^(−z) / (1 + e^(−z))

Therefore:

p(1 − p) = [1 / (1 + e^(−z))] · [e^(−z) / (1 + e^(−z))] = e^(−z) / (1 + e^(−z))²

Thus:

∂p/∂z = p(1 − p)

This compact form is one reason logistic regression is so convenient.

Interpretability: weight signs and magnitudes #

Because log-odds are linear:

log(p/(1−p)) = w·x + b

Each weight wⱼ has a direct interpretation:

•Increasing xⱼ by 1 increases log-odds by wⱼ (holding others fixed).
•If wⱼ > 0, that feature pushes toward class 1.
•If wⱼ < 0, it pushes toward class 0.

Caution: interpretation depends on feature scaling. If one feature is measured in large units, its weight will tend to be smaller.

A tiny comparison table #

Component	Linear regression (for y ∈ ℝ)	Logistic regression (for y ∈ {0,1})
Score	w·x + b	w·x + b
Output	ŷ = score	ŷ = σ(score) ∈ (0,1)
Typical loss	squared error	binary cross-entropy
Probabilistic meaning	Gaussian noise assumption	Bernoulli likelihood

This sets up the next step: choosing a loss that matches Bernoulli labels.

Core Mechanic 2: Binary Cross-Entropy as Negative Log-Likelihood #

Why we need a special loss #

For classification, we don’t just want “close numeric values.” We want:

•confident correct predictions to be rewarded
•confident wrong predictions to be punished strongly
•a probabilistic interpretation (so “0.9” means something)

Binary cross-entropy (BCE) comes directly from maximum likelihood estimation for a Bernoulli model.

Bernoulli model for labels #

Assume for each input x, the label y is drawn as:

P(y = 1 | x) = p

P(y = 0 | x) = 1 − p

with p = σ(z) and z = w·x + b.

The Bernoulli probability mass function can be written compactly as:

P(y | x) = pʸ (1 − p)^(1−y)

because:

•if y = 1 ⇒ p¹(1−p)⁰ = p
•if y = 0 ⇒ p⁰(1−p)¹ = 1−p

Likelihood for a dataset #

Given N i.i.d. examples {(xᵢ, yᵢ)}:

L(w, b) = ∏ᵢ pᵢ^(yᵢ) (1 − pᵢ)^(1−yᵢ)

where pᵢ = σ(w·xᵢ + b).

Maximizing a product is awkward, so take logs:

log L = ∑ᵢ [ yᵢ log pᵢ + (1 − yᵢ) log(1 − pᵢ) ]

Maximum likelihood is equivalent to minimizing negative log-likelihood:

J(w, b) = − log L = − ∑ᵢ [ yᵢ log pᵢ + (1 − yᵢ) log(1 − pᵢ) ]

Often we average over N:

J = (1/N) ∑ᵢ ℓᵢ

with per-example loss:

ℓ = −[ y log p + (1 − y) log(1 − p) ]

That is the binary cross-entropy loss.

Why BCE behaves the way we want #

Consider one example.

•If y = 1, the loss is ℓ = −log p
•If p = 0.9 ⇒ ℓ ≈ 0.105
•If p = 0.01 ⇒ ℓ ≈ 4.605 (large penalty)
•If y = 0, the loss is ℓ = −log(1 − p)
•If p = 0.1 ⇒ ℓ ≈ 0.105
•If p = 0.99 ⇒ ℓ ≈ 4.605

So BCE strongly penalizes confident mistakes.

Gradient: the clean “prediction minus label” form #

This is the workhorse result that makes training simple.

For one example:

p = σ(z), z = w·x + b

ℓ = −[ y log p + (1 − y) log(1 − p) ]

We compute ∂ℓ/∂z.

Step 1: derivative of ℓ with respect to p:

∂ℓ/∂p = −[ y · (1/p) + (1 − y) · (−1/(1 − p)) ]

∂ℓ/∂p = − y/p + (1 − y)/(1 − p)

Step 2: chain rule with ∂p/∂z = p(1 − p):

∂ℓ/∂z = (∂ℓ/∂p)(∂p/∂z)

∂ℓ/∂z = ( − y/p + (1 − y)/(1 − p) ) · p(1 − p)

Distribute p(1 − p):

∂ℓ/∂z = −y(1 − p) + (1 − y)p

∂ℓ/∂z = −y + yp + p − yp

∂ℓ/∂z = p − y

So the derivative w.r.t. the score is:

∂ℓ/∂z = (p − y) = (ŷ − y)

Now apply z = w·x + b:

∂z/∂w = x

∂z/∂b = 1

Thus:

∇_w ℓ = (ŷ − y)x

∂ℓ/∂b = (ŷ − y)

For the full dataset (averaged):

∇_w J = (1/N) ∑ᵢ (ŷᵢ − yᵢ)**xᵢ

∂J/∂b = (1/N) ∑ᵢ (ŷᵢ − yᵢ)

This is the key computational loop: predict p, compute error (p − y), accumulate gradients.

Convexity (a practical perk) #

For standard logistic regression (no hidden layers), the BCE objective is convex in (w, b). That means:

•there is a single global minimum
•gradient descent won’t get trapped in bad local minima (though it can still be slow)

This is a major difference from neural networks, where the objective is non-convex.

Application/Connection: Training, Decision Thresholds, and the Bridge to Neural Networks #

Training with gradient descent (putting the pieces together) #

A typical training step:

1)Compute zᵢ = w·xᵢ + b
2)Compute ŷᵢ = σ(zᵢ)
3)Compute gradients:

•g_w = (1/N) ∑ᵢ (ŷᵢ − yᵢ)**xᵢ
•g_b = (1/N) ∑ᵢ (ŷᵢ − yᵢ)

4)Update parameters (learning rate η):

w ← w − η g_w

b ← b − η g_b

Because you already know gradient descent, the main learning here is: BCE + sigmoid makes the gradient become “prediction minus label.”

Decision thresholds and costs #

The model outputs a probability ŷ. Turning it into a label requires a threshold t.

•Default: t = 0.5
•Predict 1 if ŷ ≥ t

But in many real applications, false positives and false negatives have different costs.

Examples:

•Medical screening: prefer fewer false negatives ⇒ lower threshold
•Spam filtering: prefer fewer false positives ⇒ higher threshold

So logistic regression naturally supports probability-based decision-making.

Evaluation metrics (quick orientation) #

Accuracy is not always enough, especially with class imbalance.

Common choices:

•Precision, recall, F1
•ROC curve and AUC
•Log loss (the same BCE, on held-out data)

BCE is a natural metric because it evaluates probability quality, not just hard labels.

Regularization (brief but important) #

To reduce overfitting, add a penalty on w.

L2 regularization (ridge):

J_reg = J + (λ/2)‖w‖²

Gradient adds:

∇_w J_reg = ∇_w J + λw

(Usually the bias b is not regularized.)

L1 regularization (lasso) encourages sparsity, but its gradient uses subgradients and optimization needs more care.

Numerical stability: logits and “BCE with logits” #

Directly computing log(σ(z)) can cause issues when z is very large in magnitude.

In practice, libraries use a stable form often called binary cross-entropy with logits, where you pass z (the logit) directly.

This is a practical detail, but it matters for robust training.

Connection to neural networks #

Logistic regression is a 1-layer neural network:

•Input: x
•Linear layer: z = w·x + b
•Activation: σ(z)
•Loss: BCE

When you later learn neural networks, you’ll generalize the linear layer to multiple layers and nonlinearities. The final layer for binary classification often remains a sigmoid (or a 2-class softmax), and the loss remains cross-entropy.

So mastering logistic regression means you already understand:

•linear layers
•activations
•cross-entropy objectives
•gradient-based training

You’re standing right at the entrance to Neural Networks.

Worked Examples (3) #

Compute a prediction and interpret it (score → probability → decision) #

Let w = (0.8, −0.4), b = −0.2. For input x = (2, 1), compute z, ŷ = σ(z), and the predicted class with threshold 0.5.

Compute the linear score:
z = w·x + b
= (0.8)(2) + (−0.4)(1) + (−0.2)
= 1.6 − 0.4 − 0.2
= 1.0
Map score to probability:
ŷ = σ(z) = 1 / (1 + e^(−1.0))
≈ 1 / (1 + 0.3679)
≈ 0.7311
Apply threshold t = 0.5:
ŷ ≈ 0.7311 ≥ 0.5 ⇒ predict class 1

Insight: The decision boundary is z = 0. Here z = 1 is on the positive side, and sigmoid turns that margin into a probability (about 73%).

Compute binary cross-entropy loss for one example #

Suppose the model predicts ŷ = 0.9 for an example whose true label is y = 1. Then compute the per-example BCE loss. Repeat for a confident wrong prediction ŷ = 0.01 when y = 1.

If y = 1, BCE loss is:
ℓ = −[ y log ŷ + (1 − y) log(1 − ŷ) ]
= −[ 1 · log(0.9) + 0 · log(0.1) ]
= −log(0.9)
≈ 0.1053
For ŷ = 0.01 with y = 1:
ℓ = −log(0.01)
≈ 4.6052

Insight: BCE is gentle when you’re confidently correct, but extremely harsh when you’re confidently wrong—exactly what you want for probabilistic classification.

One gradient descent step on a single example (see the (ŷ − y)x pattern) #

Single training example: x = (3, −1), y = 0. Start with w = (0, 0), b = 0. Use learning rate η = 0.1. Do one gradient descent update.

Compute score:
z = w·x + b = 0
So ŷ = σ(0) = 0.5
Compute gradients for one example:
∇_w ℓ = (ŷ − y)x
Here (ŷ − y) = 0.5 − 0 = 0.5
So:
∇_w ℓ = 0.5 · (3, −1) = (1.5, −0.5)
Bias gradient:
∂ℓ/∂b = (ŷ − y) = 0.5
Update parameters:
w ← w − η ∇_w ℓ
= (0, 0) − 0.1(1.5, −0.5)
= (−0.15, 0.05)
b ← b − η(∂ℓ/∂b)
= 0 − 0.1(0.5)
= −0.05

Insight: Because y = 0 but ŷ = 0.5 is too high, (ŷ − y) is positive, so the update moves w and b in a direction that reduces the score z on this example next time.

Key Takeaways #

✓
Logistic regression uses a linear score z = w·x + b and converts it to a probability with the sigmoid σ(z).
✓
The decision boundary ŷ = 0.5 corresponds to z = 0, a hyperplane with normal vector w.
✓
The Bernoulli likelihood leads directly to binary cross-entropy: ℓ = −[ y log ŷ + (1 − y) log(1 − ŷ) ].
✓
A crucial simplification: ∂ℓ/∂z = ŷ − y, giving ∇_w ℓ = (ŷ − y)x and ∂ℓ/∂b = ŷ − y.
✓
Logistic regression’s objective is convex, making optimization more reliable than many non-convex models.
✓
Thresholds can be adjusted away from 0.5 to reflect unequal error costs; probabilities enable this flexibility.
✓
Regularization like (λ/2)‖w‖² is commonly added to reduce overfitting and improves generalization.
✓
Logistic regression is effectively a single-neuron neural network: linear layer + sigmoid + cross-entropy.

Common Mistakes #

✗
Using mean squared error instead of binary cross-entropy, which usually yields worse probabilistic behavior and gradients for classification.
✗
Forgetting the bias term b (or forgetting to include x₀ = 1 when folding bias into w), which can severely limit the decision boundary.
✗
Interpreting weights without considering feature scaling; weights are only comparable when features are on comparable scales.
✗
Computing log(σ(z)) and log(1−σ(z)) naively for large |z|, leading to numerical instability instead of using a stable “BCE with logits” formulation.

Practice #

easy

Given w = (1, −2), b = 0.5, and x = (1, 2), compute z, ŷ = σ(z), and the predicted label using threshold 0.5.

Hint: Compute z = 1·1 + (−2)·2 + 0.5, then apply σ(z). If z < 0 then ŷ < 0.5.

Show solution

z = (1)(1) + (−2)(2) + 0.5 = 1 − 4 + 0.5 = −2.5.

ŷ = σ(−2.5) = 1/(1+e^(2.5)) ≈ 1/(1+12.182) ≈ 0.0759.

Since ŷ < 0.5, predict label 0.

medium

Show that for BCE with sigmoid output, the derivative with respect to the logit z is ∂ℓ/∂z = ŷ − y.

Hint: Use ℓ = −[ y log ŷ + (1 − y) log(1 − ŷ) ], then chain rule: (∂ℓ/∂ŷ)(∂ŷ/∂z). Recall ∂σ/∂z = ŷ(1 − ŷ).

Show solution

Let ŷ = σ(z).

ℓ = −[ y log ŷ + (1 − y) log(1 − ŷ) ].

Compute:

∂ℓ/∂ŷ = −[ y(1/ŷ) + (1 − y)(−1/(1 − ŷ)) ] = −y/ŷ + (1 − y)/(1 − ŷ).

Also ∂ŷ/∂z = ŷ(1 − ŷ).

So:

∂ℓ/∂z = (−y/ŷ + (1 − y)/(1 − ŷ))·ŷ(1 − ŷ)

= −y(1 − ŷ) + (1 − y)ŷ

= −y + yŷ + ŷ − yŷ

= ŷ − y.

hard

One-step update with two examples (mini-batch): Start w = (0, 0), b = 0, η = 0.2. Examples: (x₁=(1,0), y₁=1) and (x₂=(0,1), y₂=0). Use the average gradient over the two examples to update w and b once.

Hint: With w = 0 and b = 0, both logits are 0 so both predictions are 0.5. Compute (ŷ − y) for each example, then average gradients: (1/N)∑(ŷ − y)x.

Show solution

Initial: z₁ = 0, z₂ = 0 ⇒ ŷ₁ = ŷ₂ = 0.5.

Errors:

(ŷ₁ − y₁) = 0.5 − 1 = −0.5

(ŷ₂ − y₂) = 0.5 − 0 = 0.5

Average weight gradient:

∇_w J = (1/2)[(−0.5)(1,0) + (0.5)(0,1)]

= (1/2)[(−0.5, 0) + (0, 0.5)]

= (−0.25, 0.25)

Average bias gradient:

∂J/∂b = (1/2)[(−0.5) + (0.5)] = 0

Update:

w ← w − η∇_wJ = (0,0) − 0.2(−0.25, 0.25) = (0.05, −0.05)

b ← 0 − 0.2(0) = 0.

Connections #

•Next: Neural Networks
•Related foundations you used here:
•Maximum Likelihood Estimation
•Gradient Descent
•Often paired with:
•Feature Scaling and Normalization
•Regularization
•Model Evaluation Metrics

Quality: A (4.5/5)

← back to tree browse all →

Logistic Regression #

Interactive Visualization #

Core Concepts #

Key Symbols & Notation #

Essential Relationships #

Prerequisites (3) #

Unlocks (1) #

Referenced by (2) #

From Business (2) #

Graph Position #

Cognitive Load #

All Concepts (14) #

Teaching Strategy #

What Is Logistic Regression? #

The problem it solves #

Why we don’t just use linear regression #

Intuition: score → probability #

Odds and log-odds (why sigmoid is a natural choice) #

Notation note: include bias in w #

Summary #

Core Mechanic 1: Linear Predictor and the Sigmoid #

Why start with a linear predictor? #

Geometry: a hyperplane decision boundary #

Why the sigmoid specifically? #

Sensitivity: sigmoid derivative #

Interpretability: weight signs and magnitudes #

A tiny comparison table #

Core Mechanic 2: Binary Cross-Entropy as Negative Log-Likelihood #

Why we need a special loss #

Bernoulli model for labels #

Likelihood for a dataset #

Why BCE behaves the way we want #

Gradient: the clean “prediction minus label” form #

Convexity (a practical perk) #

Application/Connection: Training, Decision Thresholds, and the Bridge to Neural Networks #

Training with gradient descent (putting the pieces together) #

Decision thresholds and costs #

Evaluation metrics (quick orientation) #

Regularization (brief but important) #

Numerical stability: logits and “BCE with logits” #

Connection to neural networks #

Worked Examples (3) #

Compute a prediction and interpret it (score → probability → decision) #

Compute binary cross-entropy loss for one example #

One gradient descent step on a single example (see the (ŷ − y)**x** pattern) #

Key Takeaways #

Common Mistakes #

Practice #

Connections #

One gradient descent step on a single example (see the (ŷ − y)x pattern) #