←Back to Tech Tree
inventorycoverage
Maximum Likelihood Estimation #
Probability & StatisticsDifficulty: ★★★☆☆Depth: 6Unlocks: 44
Finding parameters that maximize probability of observed data.
Interactive Visualization #
⏮◀◀▶▶STEP0.25x1xZOOM
t=0s
Core Concepts #
- -Likelihood function: the joint probability (pmf) or density (pdf) of the observed data viewed as a function of the parameter(s) with the data fixed.
- -Maximum-likelihood principle: the MLE is the parameter value that maximizes the likelihood function (a point estimate chosen to make the observed data most probable).
- -Score / first-order condition: internal maximizers satisfy that the derivative (gradient) of the log-likelihood with respect to the parameter equals zero (score = 0).
Key Symbols & Notation #
theta - the parameter (scalar or vector) being estimated.L(theta) - the likelihood function: the joint probability/density of the observed data expressed as a function of theta.
Essential Relationships #
- -argmax_theta L(theta) = argmax_theta log L(theta): the logarithm is monotone so maximizing log-likelihood gives the same estimator and simplifies products to sums.
Prerequisites (2) #
Common Distributions6 atomsDerivatives6 atoms
Unlocks (6) #
Machine Learning Introductionlvl 3Bayesian Inferencelvl 4Logistic Regressionlvl 3KL Divergencelvl 4Confidence Intervalslvl 3Cross-Validationlvl 4
Advanced Learning Details
Graph Position #
68
Depth Cost
44
Fan-Out (ROI)
20
Bottleneck Score
6
Chain Length
Cognitive Load #
6
Atomic Elements
32
Total Elements
L1
Percentile Level
L4
Atomic Level
All Concepts (13) #
- Likelihood function: viewing the probability or density of observed data as a function of the model parameter(s) (data fixed, parameter variable)
- Log-likelihood: the natural logarithm of the likelihood used to simplify calculations
- Maximum likelihood estimator (MLE): the parameter value(s) that maximize the likelihood/log-likelihood
- IID-product form of the likelihood for independent observations: the joint likelihood is the product of individual densities/probabilities
- Use of the log to convert the product-form likelihood into a sum (computational/numerical simplification)
- Score function: the derivative(s) of the log-likelihood with respect to parameter(s)
- Likelihood equations: setting the score(s) to zero to obtain candidate MLE(s)
- Observed information: the negative second derivative (Hessian) of the log-likelihood at a parameter value (measures curvature)
- Fisher information: the expected information (expected value of the observed information or variance of the score)
- Approximate variance/standard error of an MLE obtained from (observed or Fisher) information
- Invariance property of MLE: the MLE of a function g(θ) is g(θ̂)
- Practical issues for MLE: existence/non-uniqueness, boundary solutions, and need to check second-order conditions for maxima
- Distinction in role between 'likelihood as a function of parameters' and 'probability/density as a function of data'
Teaching Strategy #
Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.
You’ve collected data. You believe it came from a distribution with some unknown parameter θ. Maximum Likelihood Estimation (MLE) is the idea of choosing θ so that, under your model, the data you actually saw would be as probable as possible.
TL;DR:
Fix the observed data and view the joint pmf/pdf as a function of θ: L(θ). The MLE is θ̂ = argmaxθ L(θ). In practice we maximize the log-likelihood ℓ(θ) = log L(θ), and interior optima satisfy the score equation ∇θ ℓ(θ) = 0 (plus a second-order/endpoint check).
What Is Maximum Likelihood Estimation? #
The problem MLE is trying to solve (why) #
In statistics and machine learning, we often start with a model family—a distribution we think could plausibly generate our data, but with unknown parameter(s). Examples:
- •Bernoulli(p) for binary outcomes (unknown p)
- •Poisson(λ) for counts (unknown λ)
- •Normal(μ, σ²) for measurements (unknown μ and/or σ²)
You then observe data: x₁, x₂, …, xₙ. The central question is:
Which parameter value θ makes these observations most plausible under the model?
MLE answers: choose the θ that maximizes the probability (for discrete) or density (for continuous) of what you observed.
Likelihood vs probability (intuition) #
This is the most important mental switch:
- •Probability/density: treat θ as fixed, data as random.
- •Likelihood: treat the observed data as fixed, and treat θ as the variable.
Formally, suppose the data are generated i.i.d. from a distribution with pmf/pdf f(x | θ). After you observe x₁, …, xₙ, define the likelihood function
L(θ) = ∏ᵢ f(xᵢ | θ)
This is not “the probability of θ”. It is a function that scores different θ values by how well they explain the observed data.
A concrete picture #
Imagine coin flips: xᵢ ∈ {0, 1}, where 1 = heads. If you see 9 heads out of 10 flips, then:
- •p = 0.9 should give a high likelihood
- •p = 0.1 should give a very low likelihood
The MLE picks the p that yields the highest L(p).
Definition: maximum-likelihood estimator #
The maximum likelihood estimator (MLE) is
θ̂ = argmaxθ L(θ)
Often θ is scalar, but in many ML models θ is a vector θ (weights). Then we write
θ̂ = argmax_{θ} L(θ)
Why we often maximize log-likelihood instead #
L(θ) is a product of many terms. Products can be:
- •numerically tiny (underflow)
- •algebraically messy
Because log is strictly increasing, maximizing L(θ) is equivalent to maximizing the log-likelihood:
ℓ(θ) = log L(θ) = log(∏ᵢ f(xᵢ | θ))
Use log rules:
ℓ(θ)
= log(∏ᵢ f(xᵢ | θ))
= ∑ᵢ log f(xᵢ | θ)
So MLE becomes:
θ̂ = argmaxθ ℓ(θ)
This “sum of per-example contributions” is one reason likelihood-based methods scale well to large datasets.
When MLE is a modeling commitment #
MLE is only as good as the model family f(x | θ). If the model is wrong (e.g., assuming normality for heavy-tailed data), the MLE still returns the best-fitting θ within that family—but it may not be a good description of reality. This is not a flaw of calculus; it’s the consequence of the assumptions.
Core Mechanic 1: Building the Likelihood (and Log-Likelihood) #
Start from a generative story (why) #
To write down a likelihood, you need a story for how the data are generated. The story is the distribution f(x | θ).
Typical assumptions:
- 1)Independence: xᵢ’s do not influence each other given θ.
- 2)Identical distribution: all xᵢ share the same parameter θ.
These are simplifying assumptions, but they give the clean factorization:
L(θ) = f(x₁, …, xₙ | θ) = ∏ᵢ f(xᵢ | θ)
If independence is not justified (time series, spatial data), the likelihood changes form—but the MLE principle is the same.
Discrete vs continuous: pmf vs pdf #
- •If x is discrete (Bernoulli, Poisson), f(x | θ) is a pmf and L(θ) is a true probability.
- •If x is continuous (Normal), f(x | θ) is a pdf and L(θ) is a density value (not a probability of an exact point). You can still maximize it.
Likelihood is not bounded by 1 #
A common surprise: densities can exceed 1, so L(θ) can exceed 1. That’s fine. Only probabilities must be ≤ 1.
The log-likelihood decomposes nicely #
With i.i.d. data:
ℓ(θ) = ∑ᵢ log f(xᵢ | θ)
This gives you:
- •easier differentiation
- •easier numerical optimization
- •an “average loss” viewpoint: (1/n)ℓ(θ)
In machine learning, we often minimize negative log-likelihood (NLL):
NLL(θ) = −ℓ(θ) = −∑ᵢ log f(xᵢ | θ)
A helpful comparison table #
| Object | Notation | Data treated as | Parameter treated as | Typical use |
|---|
| pmf/pdf | f(x | θ) | random | fixed |
| likelihood | L(θ) = ∏ᵢ f(xᵢ | θ) | fixed (observed) | variable |
| log-likelihood | ℓ(θ) = ∑ᵢ log f(xᵢ | θ) | fixed | variable |
| negative log-likelihood | −ℓ(θ) | fixed | variable | ML loss minimization |
A note on vector parameters #
If θ is a vector θ ∈ ℝᵈ, then:
- •the likelihood is L(θ)
- •the log-likelihood is ℓ(θ)
- •derivatives become gradients ∇_{θ} ℓ(θ)
The geometry matters: you’re maximizing a surface over ℝᵈ, not a curve over ℝ.
Core Mechanic 2: Maximizing the Log-Likelihood (Score and Conditions) #
Why calculus enters (why) #
Once you have ℓ(θ), estimation becomes an optimization problem. For many classical distributions, you can solve it analytically by taking derivatives and setting them to zero.
The key idea: an interior maximum of a differentiable function has derivative 0.
The score function (first-order condition) #
Define the score as the derivative (or gradient) of the log-likelihood:
- •Scalar θ: s(θ) = dℓ(θ)/dθ
- •Vector θ: s(θ) = ∇_{θ} ℓ(θ)
First-order condition (FOC) for an interior optimum:
s(θ̂) = 0
(or s(θ̂) = 0)
This gives candidate solutions.
Second-order condition (is it a max?) #
Setting the derivative to zero finds critical points: maxima, minima, or saddle points.
- •Scalar θ: check d²ℓ(θ)/dθ² < 0 at θ̂ for a local maximum.
- •Vector θ: check the Hessian H(θ) = ∇²_{θ} ℓ(θ). A sufficient condition for a strict local maximum is that H(θ̂) is negative definite.
Boundary solutions matter #
Sometimes the maximum occurs at the boundary of the parameter space.
Example: Bernoulli p must satisfy 0 ≤ p ≤ 1. If all outcomes are 1, the likelihood increases as p → 1, so the MLE is p̂ = 1 (a boundary point). In that case, the derivative-based interior condition may not apply.
Why log-likelihood often yields simple equations #
Because log turns products into sums, differentiation typically yields sums you can simplify.
A pattern you’ll see repeatedly:
- 1)Write ℓ(θ) = ∑ᵢ log f(xᵢ | θ)
- 2)Differentiate term-by-term
- 3)Set the resulting expression to 0
- 4)Solve for θ̂
In many ML models (logistic regression, neural nets), ℓ(θ) is differentiable but does not yield an algebraic closed-form solution.
Then MLE becomes numerical optimization:
- •gradient ascent on ℓ(θ)
- •gradient descent on −ℓ(θ)
- •Newton / quasi-Newton methods using curvature information
Even when we can’t solve the score equation analytically, the score still guides algorithms.
A small but powerful conceptual link #
Maximizing log-likelihood is equivalent to minimizing average surprise:
(1/n)NLL(θ) = −(1/n)∑ᵢ log f(xᵢ | θ)
So MLE chooses the parameter that makes the observed data as unsurprising as possible under the model.
Application/Connection: How MLE Shows Up in Machine Learning and Statistics #
MLE as the engine behind many ML losses (why) #
A large fraction of “standard losses” in machine learning are just negative log-likelihoods for some probabilistic model.
- •Linear regression with Gaussian noise ⇒ squared error loss
- •Logistic regression ⇒ Bernoulli likelihood ⇒ cross-entropy loss
- •Softmax classification ⇒ categorical likelihood ⇒ multinomial cross-entropy
So MLE isn’t just a statistics technique—it’s a unifying design principle for objective functions.
Example: Gaussian likelihood ⇒ squared error #
Assume:
yᵢ = μ(xᵢ; w) + εᵢ, εᵢ ∼ Normal(0, σ²)
Then:
f(yᵢ | xᵢ, w) = (1/√(2πσ²)) exp(−(yᵢ − μ(xᵢ; w))² / (2σ²))
Log-likelihood (dropping constants not depending on w):
ℓ(w)
= ∑ᵢ [ −(yᵢ − μ(xᵢ; w))² / (2σ²) ] + const
Maximizing ℓ(w) ⇔ minimizing ∑ᵢ (yᵢ − μ(xᵢ; w))²
That is least squares.
Example: Bernoulli likelihood ⇒ cross-entropy #
If yᵢ ∈ {0,1} and model predicts pᵢ = σ(wᵀxᵢ), then
f(yᵢ | xᵢ, w) = pᵢ^{yᵢ} (1 − pᵢ)^{1−yᵢ}
NLL is
−ℓ(w) = −∑ᵢ [ yᵢ log pᵢ + (1−yᵢ) log(1−pᵢ) ]
That is binary cross-entropy.
Statistical properties you’ll later connect to #
MLE is popular because, under regularity conditions and for large n:
- •Consistency: θ̂ → θ (in probability)
- •Asymptotic normality: √n(θ̂ − θ) ≈ Normal(0, I(θ)⁻¹)
- •Efficiency: achieves optimal variance among many estimators (Cramér–Rao ideas)
You don’t need these proofs yet, but they explain why MLE is often the default.
MLE vs Bayesian inference (a preview) #
MLE chooses a single best θ.
Bayesian inference treats θ as random and updates a prior p(θ) to a posterior:
p(θ | data) ∝ p(data | θ) p(θ)
Notice p(data | θ) is exactly the likelihood (up to notation). So MLE and Bayes share the same core ingredient; Bayes adds a prior.
A conceptual bridge to KL divergence #
In many settings, maximizing expected log-likelihood is equivalent to minimizing KL divergence between the true data-generating distribution and your model family. This is one reason KL divergence shows up everywhere in ML.
Connecting to confidence intervals #
Once you have θ̂, you often want uncertainty. Many confidence interval methods start from the curvature of ℓ(θ) near θ̂ (observed Fisher information / Hessian). So MLE is a gateway to inference, not just point estimation.
Worked Examples (3) #
Bernoulli MLE: estimating a coin’s bias #
Let x₁,…,xₙ be i.i.d. Bernoulli(p), where xᵢ ∈ {0,1}. We observe k = ∑ᵢ xᵢ heads (1s). Find the MLE p̂.
Write the pmf for one observation:
f(xᵢ | p) = p^{xᵢ}(1−p)^{1−xᵢ}
Write the likelihood (independence ⇒ product):
L(p) = ∏ᵢ p^{xᵢ}(1−p)^{1−xᵢ}
Collect exponents using ∑ᵢ xᵢ = k and ∑ᵢ (1−xᵢ) = n−k:
L(p) = p^k (1−p)^{n−k}
Take logs to simplify:
ℓ(p) = log L(p)
= log(p^k (1−p)^{n−k})
= k log p + (n−k) log(1−p)
Differentiate (score) and set to zero (interior solution):
dℓ/dp = k·(1/p) + (n−k)·(−1/(1−p))
= k/p − (n−k)/(1−p)
Set dℓ/dp = 0:
k/p = (n−k)/(1−p)
Solve for p:
k(1−p) = p(n−k)
k − kp = pn − pk
k = pn
p̂ = k/n
Check it is a maximum (second derivative):
d²ℓ/dp² = −k/p² − (n−k)/(1−p)² < 0 for p ∈ (0,1)
So the critical point is a (strict) local maximum.
Insight: For Bernoulli data, the MLE equals the sample mean: p̂ = (1/n)∑ᵢ xᵢ. This is a recurring theme: MLE often matches intuitive “frequency” estimators.
Poisson MLE: estimating a rate from counts #
Let x₁,…,xₙ be i.i.d. Poisson(λ), with λ > 0. Find the MLE λ̂.
Write the pmf for one observation:
f(xᵢ | λ) = e^{−λ} λ^{xᵢ} / xᵢ!
Likelihood:
L(λ) = ∏ᵢ [ e^{−λ} λ^{xᵢ} / xᵢ! ]
Simplify the product:
L(λ) = (∏ᵢ e^{−λ}) (∏ᵢ λ^{xᵢ}) / (∏ᵢ xᵢ!)
= e^{−nλ} λ^{∑ᵢ xᵢ} / (∏ᵢ xᵢ!)
Log-likelihood (dropping constants that do not depend on λ):
ℓ(λ) = log L(λ)
= (−nλ) + (∑ᵢ xᵢ) log λ − ∑ᵢ log(xᵢ!)
Differentiate and set to zero:
dℓ/dλ = −n + (∑ᵢ xᵢ)/λ
Set dℓ/dλ = 0:
−n + (∑ᵢ xᵢ)/λ = 0
(∑ᵢ xᵢ)/λ = n
Solve:
λ̂ = (1/n)∑ᵢ xᵢ
Second derivative check:
d²ℓ/dλ² = −(∑ᵢ xᵢ)/λ² < 0 for λ > 0 (assuming not all xᵢ are 0)
So it’s a maximum.
Insight: Again, the MLE matches the sample mean. For Poisson, the mean equals λ, so the MLE is the natural plug-in estimator.
Normal MLE (μ known σ² unknown): estimating variance carefully #
Let x₁,…,xₙ be i.i.d. Normal(μ, σ²). Assume μ is known. Find the MLE for σ².
Write the pdf:
f(xᵢ | σ²) = (1/√(2πσ²)) exp(−(xᵢ−μ)² / (2σ²))
Likelihood:
L(σ²) = ∏ᵢ (1/√(2πσ²)) exp(−(xᵢ−μ)² / (2σ²))
Log-likelihood:
ℓ(σ²)
= ∑ᵢ [ −(1/2)log(2πσ²) − (xᵢ−μ)²/(2σ²) ]
= −(n/2)log(2πσ²) − (1/(2σ²))∑ᵢ (xᵢ−μ)²
Differentiate w.r.t. σ²:
dℓ/d(σ²)
= −(n/2)·(1/σ²) + (1/2)(∑ᵢ (xᵢ−μ)²)·(1/(σ²)²)
Explanation: derivative of −(1/(2σ²))S is +(1/2)S·(1/(σ²)²), where S = ∑ᵢ (xᵢ−μ)²
Set derivative to zero:
−(n/2)(1/σ²) + (1/2)S(1/(σ²)²) = 0
Multiply both sides by 2(σ²)² to clear fractions:
−nσ² + S = 0
Solve:
σ̂²_MLE = S/n = (1/n)∑ᵢ (xᵢ−μ)²
Second derivative check (sketch): curvature is negative at the solution for σ² > 0, giving a maximum.
Insight: The MLE for σ² uses 1/n, not 1/(n−1). The 1/(n−1) version is the unbiased sample variance; MLE prioritizes likelihood maximization, not unbiasedness.
Key Takeaways #
✓
The likelihood L(θ) is the joint pmf/pdf of the observed data, viewed as a function of θ with the data fixed.
✓
The MLE is θ̂ = argmaxθ L(θ); in practice we maximize ℓ(θ) = log L(θ) because it turns products into sums.
✓
For i.i.d. data, ℓ(θ) = ∑ᵢ log f(xᵢ | θ), which is computationally and conceptually convenient.
✓
Interior optima satisfy the score equation: ∇θ ℓ(θ̂) = 0; then you must verify it’s a maximum (curvature) or check boundaries.
✓
Many familiar estimators are MLEs (e.g., Bernoulli p̂ = sample mean; Poisson λ̂ = sample mean).
✓
Many ML loss functions are negative log-likelihoods (cross-entropy, squared error under Gaussian noise).
✓
MLE depends on the assumed model family; it returns the best fit within that family, even if the family is misspecified.
Common Mistakes #
✗
Treating L(θ) as a probability distribution over θ (it is not); only in Bayesian inference do we form p(θ | data).
✗
Forgetting parameter constraints (e.g., p ∈ [0,1], σ² > 0) and missing boundary maxima.
✗
Setting the score to zero and stopping—without checking whether the critical point is a maximum (second derivative/Hessian) or whether multiple maxima exist.
✗
Confusing the MLE variance formula (divide by n) with the unbiased sample variance (divide by n−1).
Practice #
easy
Uniform(0, θ) MLE: Suppose x₁,…,xₙ are i.i.d. Uniform(0, θ) with θ > 0. Derive the MLE θ̂.
Hint: Write f(x|θ) = 1/θ for 0 ≤ x ≤ θ, and 0 otherwise. The likelihood is zero if θ is smaller than any observed value.
Show solution
For one observation: f(xᵢ|θ) = 1/θ if 0 ≤ xᵢ ≤ θ, else 0.
Likelihood:
L(θ) = ∏ᵢ (1/θ) · 𝟙{xᵢ ≤ θ}
= θ^{−n} · 𝟙{maxᵢ xᵢ ≤ θ}.
If θ < max xᵢ, then L(θ)=0. For θ ≥ m where m = max xᵢ, L(θ)=θ^{−n}, which decreases as θ increases.
So the maximum occurs at the smallest feasible θ, i.e. θ̂ = maxᵢ xᵢ.
medium
Normal(μ, σ²) MLE for μ when σ² is known: Given x₁,…,xₙ i.i.d. Normal(μ, σ²) with σ² known, derive μ̂.
Hint: Write ℓ(μ) and differentiate. The exponent contains ∑ᵢ (xᵢ−μ)².
Show solution
Log-likelihood (dropping constants not involving μ):
ℓ(μ) = −(1/(2σ²))∑ᵢ (xᵢ−μ)².
Differentiate:
dℓ/dμ = −(1/(2σ²))∑ᵢ 2(xᵢ−μ)(−1)
= (1/σ²)∑ᵢ (xᵢ−μ)
Set to zero:
∑ᵢ (xᵢ−μ) = 0
∑ᵢ xᵢ − nμ = 0
μ̂ = (1/n)∑ᵢ xᵢ.
Second derivative is −n/σ² < 0, so it’s a maximum.
medium
Boundary case for Bernoulli: You observe x₁,…,xₙ all equal to 1 (all successes). What is the MLE for p? Explain why the score equation approach can be misleading here.
Hint: Write ℓ(p) = n log p when k = n, and remember p must be in [0,1].
Show solution
If all xᵢ = 1, then k = n.
Likelihood: L(p)=p^n.
Log-likelihood: ℓ(p)=n log p, which increases as p increases on (0,1].
Thus the MLE is the boundary point p̂ = 1.
Why score can mislead: dℓ/dp = n/p, which never equals 0 for p ∈ (0,1]. The maximum is not an interior critical point; it occurs at the boundary, so the score=0 condition does not apply.
Connections #
Next nodes you can unlock and why they connect:
- •Machine Learning Introduction: Many ML algorithms are posed as maximizing likelihood or minimizing negative log-likelihood.
- •Bayesian Inference: Bayes’ rule uses the likelihood p(data | θ) as a core component; MLE is a useful baseline/limit case.
- •Logistic Regression: Logistic regression is typically fit by MLE; its cross-entropy objective is the Bernoulli negative log-likelihood.
- •KL Divergence: Expected negative log-likelihood relates to cross-entropy and KL; MLE can be viewed as choosing parameters that minimize KL to the true distribution (under conditions).
- •Confidence Intervals: Curvature of the log-likelihood around θ̂ underpins standard errors and interval estimates.
Quality: A (4.6/5)
← back to treebrowse all →