Second-Order Optimization

←Back to Tech Tree

inventorycoverage

Second-Order Optimization #

OptimizationDifficulty: ★★★★★Depth: 9Unlocks: 0

Newton's method, BFGS, L-BFGS. Using Hessian information.

Interactive Visualization #

⏮◀◀▶▶STEP0.25x1xZOOM

t=0s

Core Concepts #

Key Symbols & Notation #

H - Hessian matrix (matrix of second derivatives)B - inverse-Hessian approximation (matrix used by quasi-Newton methods)

Essential Relationships #

Prerequisites (2) #

Matrix Calculus6 atomsConvex Optimization5 atoms

Advanced Learning Details

Graph Position #

94

Depth Cost

0

Fan-Out (ROI)

0

Bottleneck Score

9

Chain Length

Cognitive Load #

6

Atomic Elements

69

Total Elements

L4

Percentile Level

L4

Atomic Level

All Concepts (30) #

Teaching Strategy #

Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.

Gradient methods know which way is downhill. Second-order methods also know how the ground curves—so they can choose steps that are both directional and properly scaled.

TL;DR:

Second-order optimization uses curvature information (the Hessian H or an approximation to H⁻¹) to build a local quadratic model and compute a step that is often far more scale-aware than plain gradient descent. Newton’s method uses H directly (solve H p = −∇f). Quasi-Newton methods (BFGS, L-BFGS) avoid forming H by updating an inverse-Hessian approximation B using gradient differences, giving many of Newton’s benefits at much lower cost.

What Is Second-Order Optimization? #

Why you should care (motivation before formulas) #

First-order methods (gradient descent, momentum, Adam) use ∇f(x) to pick a direction of improvement. But they do not know how the objective bends. If the landscape is stretched—steep in one direction and flat in another—then a single global learning rate is a compromise: small enough not to explode in steep directions, but then painfully slow in flat directions.

Second-order methods solve this by using curvature: they estimate how f changes for a whole step vector p, not just for an infinitesimal move. This usually yields steps that are automatically scaled per-direction. In well-behaved problems, this can turn long, zig-zagging trajectories into short, decisive moves.

The central idea: local quadratic modeling #

Second-order methods are built on the second-order Taylor approximation around a point x:

f(x + p) ≈ f(x) + ∇f(x)ᵀ p + ½ pᵀ H(x) p

This approximation says: “near x, the function looks like a quadratic bowl (or saddle).” If we trust that local quadratic, we can choose p by minimizing the quadratic model.

Minimizing the quadratic model gives the Newton step #

Define the quadratic model:

m(p) = f(x) + ∇f(x)ᵀ p + ½ pᵀ H(x) p

To minimize m(p), differentiate with respect to p and set to zero:

∇ₚ m(p) = ∇f(x) + H(x) p = 0

So the step solves:

H(x) p = −∇f(x)

If H is invertible:

p = −H(x)⁻¹ ∇f(x)

This is the Newton step. It is not “just a better direction”; it is a direction and a scale determined by curvature.

When Newton is amazing (and when it isn’t) #

Newton can converge extremely fast near a minimizer—often quadratically: the number of correct digits can roughly double each iteration in ideal settings. That’s why Newton-like methods are central in classical optimization.

But Newton also has costs and pitfalls:

Quasi-Newton methods (BFGS, L-BFGS) keep the good idea—use curvature to scale steps—while avoiding explicit Hessians.

Key symbols in this node #

In practice, many algorithms compute the step as:

p = −B ∇f(x)

with B updated cheaply from gradient information.

Core Mechanic 1: Newton’s Method via the Second-Order Taylor Model #

Why Newton’s method is the “natural” second-order step #

If you accept the local quadratic approximation as your model, then the Newton step is simply the optimizer of that model. There’s no extra heuristic—Newton is what happens when you take the quadratic seriously.

Derivation with breathing room #

Start from the second-order expansion at x:

f(x + p) ≈ f(x) + ∇f(x)ᵀ p + ½ pᵀ H(x) p

Let g = ∇f(x) and H = H(x) for short. Then:

m(p) = f(x) + gᵀ p + ½ pᵀ H p

Take the gradient with respect to p:

∇ₚ m(p) = g + H p

Set to zero at optimum:

g + H p = 0

Solve:

H p = −g

So the update is:

xₖ₊₁ = xₖ + p

Newton as “whitened” gradient descent #

Gradient descent uses p = −α g. Newton uses p = −H⁻¹ g.

You can interpret H⁻¹ as a preconditioner: it rescales the gradient by curvature. If the function is steep in one direction, H has large eigenvalues there, and H⁻¹ shrinks the step along that direction. If it is flat, H⁻¹ expands the step.

More explicitly, if H = Q Λ Qᵀ (eigendecomposition with orthonormal Q), then:

H⁻¹ = Q Λ⁻¹ Qᵀ

and

p = −Q Λ⁻¹ Qᵀ g

So each eigen-direction is scaled by 1/λᵢ.

Descent conditions: when is the Newton step safe? #

A direction p is a descent direction if gᵀ p < 0.

For Newton:

p = −H⁻¹ g

Compute gᵀ p:

gᵀ p = gᵀ (−H⁻¹ g) = −gᵀ H⁻¹ g

If H is symmetric positive definite (SPD), then H⁻¹ is SPD and gᵀ H⁻¹ g > 0 for any nonzero g. Therefore gᵀ p < 0: Newton is a descent direction.

If H is indefinite (has negative eigenvalues), then gᵀ H⁻¹ g can be negative, and the Newton step can ascend.

Damping: line search with Newton directions #

Even with SPD H, the quadratic model may not be accurate far from x. A standard fix is to take:

xₖ₊₁ = xₖ + αₖ p

where pₖ solves Hₖ pₖ = −gₖ, and αₖ is found via line search (e.g., Armijo/Wolfe conditions). This is often called damped Newton.

Trust regions: constrain the step instead of shrinking it #

Another fix is to trust the quadratic model only within a region ‖p‖ ≤ Δ:

minimize m(p) subject to ‖p‖ ≤ Δ

This changes the step when Newton would take something too large or unstable. Trust-region Newton methods are especially robust when H is indefinite.

Computational reality: you rarely invert H #

You never compute H⁻¹ explicitly. You solve the linear system:

H p = −g

using:

The solve is the “real cost” of Newton.

Newton vs gradient descent (clarity table) #

AspectGradient descentNewton’s method
Uses∇f∇f and H
Stepp = −α ∇fSolve H p = −∇f
Sensitivity to scalingHighLow (often scale-invariant under affine transforms)
Per-iteration costLowHigh
Convergence near optimumLinear (typical)Quadratic (ideal conditions)
Failure modesSlow in ill-conditioned valleysNon-descent if H indefinite; expensive

Where Newton shines #

In large-scale ML (millions of parameters), full Newton is typically too expensive—this is where quasi-Newton, L-BFGS, and approximations enter.

Core Mechanic 2: Quasi-Newton Methods (BFGS and L-BFGS) as Low-Cost Curvature Learning #

Why quasi-Newton exists #

Newton’s method asks for H and a linear solve. Quasi-Newton methods try to approximate the effect of H⁻¹ without ever forming H.

The guiding question is:

Can we learn a good linear map B that turns gradients into near-Newton steps?

That is, we want:

p = −B g

with B ≈ H⁻¹.

The secant condition (the core constraint) #

Let xₖ₊₁ = xₖ + sₖ where sₖ = xₖ₊₁ − xₖ.

Let gₖ = ∇f(xₖ) and gₖ₊₁ = ∇f(xₖ₊₁). Define:

yₖ = gₖ₊₁ − gₖ

For twice-differentiable f, for small steps:

yₖ ≈ Hₖ s

A quasi-Newton method chooses an approximation (either to H or to H⁻¹) that satisfies the secant equation:

H̃ₖ₊₁ sₖ = y

or in inverse form:

Bₖ₊₁ yₖ = s

This is the multi-dimensional analogue of matching slopes in 1D.

BFGS: updating the inverse-Hessian approximation #

BFGS maintains Bₖ ≈ Hₖ⁻¹ and updates it using s and y.

Define ρ = 1 / (ys).

Assuming ys > 0 (curvature condition), the BFGS inverse update is:

Bₖ₊₁ = (I − ρ s yᵀ) Bₖ (I − ρ y sᵀ) + ρ s s

Key properties:

Intuition: what BFGS is really doing #

B is a learned preconditioner.

BFGS accumulates this information over iterations, gradually sculpting B to match the inverse curvature.

Why the curvature condition matters (and how it is enforced) #

The condition ys > 0 ensures positive curvature along s. With line searches satisfying the (strong) Wolfe conditions, one can guarantee ys > 0 for smooth functions.

If ys ≤ 0, BFGS may lose SPD and produce non-descent steps. Practical implementations may:

Cost of full BFGS #

B is n×n. Storing and updating it costs O(n²) memory and O(n²) time per iteration—often too expensive for large n.

L-BFGS: limited-memory BFGS #

L-BFGS avoids storing B. Instead, it stores only the last m pairs {(sᵢ, yᵢ)} for i = k−m+1,…,k.

To compute p = −Bₖ gₖ, L-BFGS uses the two-loop recursion, which applies the implicit Bₖ to a vector in O(mn) time and O(mn) memory.

Two-loop recursion (conceptual form) #

Given current gradient g, set q = g.

Loop backwards over stored pairs:

Apply an initial scaling H₀ (often γ I) to get r:

Loop forwards:

Then the search direction is:

p = −r

Even though no n×n matrix is stored, the resulting p behaves like a quasi-Newton direction.

Choosing the initial scaling H₀ #

A common choice is:

γₖ = (sₖ₋₁ᵀ yₖ₋₁) / (yₖ₋₁ᵀ yₖ₋₁)

and H₀ = γₖ I.

This roughly matches average curvature magnitude.

Comparing Newton / BFGS / L-BFGS #

MethodUses H explicitly?Stores n×n matrix?Typical per-iter costStrengthsWeaknesses
NewtonYesOften yes (or factorization)High (solve)Fast local convergence, accurate stepsExpensive; needs robust globalization
BFGSNoYes (B)O(n²)Very effective for smooth medium-scale problemsMemory/time blow up for large n
L-BFGSNoNo (stores m pairs)O(mn)Scales to large n; strong practical defaultNeeds line search; can be sensitive/noisy gradients

Practical note for ML #

Quasi-Newton methods typically assume deterministic or low-noise gradients. In deep learning with mini-batch noise, L-BFGS can still be used but often requires care (larger batches, stronger line search heuristics, or stochastic quasi-Newton variants).

The conceptual win remains: approximate H⁻¹ well enough to get curvature-aware steps without paying Newton’s full price.

Application & Connection: Globalization (Line Search / Trust Region), Indefinite Hessians, and Large-Scale Practice #

Why “globalization” is necessary #

Second-order methods are fundamentally local: they rely on the Taylor approximation being accurate near the current iterate. Far from the solution, that approximation can be wrong in two ways:

  1. 1)The quadratic model predicts improvement where the true function increases.
  2. 2)The Hessian may be indefinite, reflecting saddles or nonconvex curvature.

Globalization strategies are the standard tools that make Newton-like methods robust in practice.

Line search with (quasi-)Newton directions #

Given a direction p (Newton, BFGS, L-BFGS), pick α to ensure sufficient decrease:

xₖ₊₁ = xₖ + α p

A common sufficient decrease condition (Armijo) is:

f(x + α p) ≤ f(x) + c₁ α ∇f(x)ᵀ p

with 0 < c₁ < 1.

To also ensure curvature information behaves well (especially for BFGS), enforce Wolfe (or strong Wolfe) conditions, which include:

∇f(x + α p)ᵀ p ≥ c₂ ∇f(x)ᵀ p

with c₁ < c₂ < 1.

These conditions help guarantee ys > 0, preserving SPD updates in BFGS/L-BFGS.

Trust region viewpoint (often more robust for Newton) #

Instead of searching along a line, trust region methods solve:

minimize gᵀ p + ½ pᵀ H p

subject to ‖p‖ ≤ Δ

If the Newton step lies inside the region, take it. If not, take a boundary step that balances descent and stability.

In nonconvex problems, trust regions naturally handle indefinite H by restricting step size and allowing directions of negative curvature to be exploited safely.

Indefinite Hessians and modified Newton steps #

When H is not SPD, you may:

Damping connects to the idea that you want a model that is convex enough locally to provide a descent direction.

Large-scale settings: Hessian-vector products and Newton-CG #

Even if you cannot store H, you may be able to compute H v efficiently (automatic differentiation can do this). Then you can approximately solve:

H p = −g

using conjugate gradient (CG), which needs only matrix-vector products.

This yields Newton-CG (also called truncated Newton). It can be seen as a spectrum:

When to use what (rule-of-thumb table) #

ScenarioRecommended toolWhy
Small n, high accuracy, smooth convexNewton + line search / trust regionFast convergence, stable
Medium n (10³–10⁵), smooth deterministicL-BFGSExcellent cost/benefit
Large n, can do H v, need high precisionNewton-CG / trust region Newton-CGUses curvature without storing H
Noisy mini-batch gradientsFirst-order (Adam/SGD) or specialized stochastic quasi-NewtonClassical line searches can struggle

Conceptual connection: second-order methods as “learning geometry” #

A deep way to view these methods is that they choose steps using a local metric:

That perspective connects second-order optimization to:

You now have the two atomic concepts for this node:

  1. second-order quadratic modeling (Taylor) ⇒ Newton step via solving H p = −g

  2. quasi-Newton approximation B ≈ H⁻¹ learned via secant information (s, y) ⇒ BFGS/L-BFGS

From here, you can build toward robust implementations (line search, trust region) and scalable solvers (CG, Hessian-vector products).

Worked Examples (3) #

Newton step on a quadratic: one iteration to the minimizer #

Minimize f(x) = ½ xᵀ A xbx, where A is SPD.

Let A = [[4, 0],[0, 1]] and b = [8, 2]ᵀ. Start at x₀ = [0, 0]ᵀ.

Compute one Newton step.

  1. Compute gradient and Hessian:

    ∇f(x) = A xb

    H = ∇²f(x) = A

  2. At x₀ = 0:

    g₀ = ∇f(x₀) = A 0b = −b = [−8, −2]ᵀ

  3. Newton step solves:

    H p = −g₀

    A p = −(−b) = b

  4. Solve A p = b:

    [[4,0],[0,1]] [p₁,p₂]ᵀ = [8,2]ᵀ

    So:

    4 p₁ = 8 ⇒ p₁ = 2

    1 p₂ = 2 ⇒ p₂ = 2

    Therefore p = [2,2]ᵀ

  5. Update:

    x₁ = x₀ + p = [0,0]ᵀ + [2,2]ᵀ = [2,2]ᵀ

  6. Verify optimality (for a quadratic with SPD A, optimum is A x⋆ = b):

    A x₁ = [[4,0],[0,1]] [2,2]ᵀ = [8,2]ᵀ = b

    So x₁ = x

Insight: For strictly convex quadratics, Newton’s method reaches the minimizer in one step from any start because the second-order Taylor model is exact (the function really is quadratic).

Newton on a non-quadratic 1D function: see the Taylor model drive the update #

Minimize f(x) = x⁴. Start at x₀ = 1. Compute two Newton iterations.

(Here ∇f = f′ and H = f″.)

  1. Compute derivatives:

    f′(x) = 4x³

    f″(x) = 12x²

  2. Newton update in 1D:

    xₖ₊₁ = xₖ − f′(xₖ)/f″(xₖ)

  3. At x₀ = 1:

    f′(1) = 4

    f″(1) = 12

    So:

    x₁ = 1 − 4/12 = 1 − 1/3 = 2/3

  4. At x₁ = 2/3:

    f′(2/3) = 4 (2/3)³ = 4 (8/27) = 32/27

    f″(2/3) = 12 (2/3)² = 12 (4/9) = 16/3

  5. Compute step:

    x₂ = x₁ − (32/27)/(16/3)

    = 2/3 − (32/27)·(3/16)

    = 2/3 − (96)/(432)

    = 2/3 − 2/9

    = 4/9

Insight: Newton automatically reduces step sizes near flat curvature: as x → 0, f″(x) → 0 too, so the ratio f′/f″ behaves like (4x³)/(12x²) = x/3, giving xₖ₊₁ = (2/3) xₖ. Here convergence is linear because the minimizer is not strongly convex (H at 0 is 0).

One BFGS inverse update from observed step and gradient change #

Suppose at iteration k you have Bₖ = I (2×2 identity).

You took s = [1, 0]ᵀ and observed gradient change y = [2, 0]ᵀ.

Compute Bₖ₊₁ using the BFGS inverse update.

  1. Compute ρ:

    ρ = 1/(ys) = 1/([2,0]·[1,0]) = 1/2

  2. Write the update:

    Bₖ₊₁ = (I − ρ s yᵀ) Bₖ (I − ρ y sᵀ) + ρ s s

  3. Compute matrices:

    s yᵀ = [[1],[0]] [2,0] = [[2,0],[0,0]]

    So:

    I − ρ s yᵀ = I − (1/2) [[2,0],[0,0]] = [[0,0],[0,1]]

  4. y sᵀ = [[2],[0]] [1,0] = [[2,0],[0,0]]

    So:

    I − ρ y sᵀ = [[0,0],[0,1]] as well

  5. Since Bₖ = I:

    (I − ρ s yᵀ) Bₖ (I − ρ y sᵀ)

    = [[0,0],[0,1]] I [[0,0],[0,1]]

    = [[0,0],[0,1]]

  6. Compute final term:

    ρ s sᵀ = (1/2) [[1],[0]] [1,0] = (1/2) [[1,0],[0,0]] = [[1/2,0],[0,0]]

  7. Add them:

    Bₖ₊₁ = [[0,0],[0,1]] + [[1/2,0],[0,0]] = [[1/2,0],[0,1]]

Insight: This update learned that curvature along the first coordinate is about 2 (since y ≈ H s), so the inverse curvature along that direction should be about 1/2. The second coordinate was untouched, so it stayed at 1.

Key Takeaways #

Common Mistakes #

Practice #

easy

For f(x) = ½ xᵀ A x with A = [[10,0],[0,1]], compare the gradient descent direction −g and the Newton direction −H⁻¹ g at a general point x. What is the qualitative difference in how they scale the two coordinates?

Hint: Compute g = A x and H = A. Then −H⁻¹ g simplifies a lot.

Show solution

We have g = ∇f(x) = A x and H = A.

Gradient descent direction (ignoring step size α):

−g = −A x = [−10 x₁, −1 x₂]ᵀ

So it is 10× larger in magnitude along coordinate 1.

Newton direction:

−H⁻¹ g = −A⁻¹ (A x) = −x = [−x₁, −x₂]ᵀ

So Newton cancels the anisotropic scaling of A and treats both coordinates equally (invariant to that scaling).

medium

Show that if H is SPD, then the Newton direction p solves gᵀ p < 0 whenever g ≠ 0.

Hint: Use p = −H⁻¹ g and the fact that vᵀ M v > 0 for SPD M and v ≠ 0.

Show solution

If H is SPD, then H⁻¹ is also SPD.

Newton direction: p = −H⁻¹ g.

Then:

gᵀ p = gᵀ (−H⁻¹ g) = −gᵀ H⁻¹ g.

Since H⁻¹ is SPD and g ≠ 0, we have gᵀ H⁻¹ g > 0.

Therefore gᵀ p < 0, so p is a descent direction.

hard

Given Bₖ = I, s = [1,1]ᵀ, y = [3,1]ᵀ, compute ρ = 1/(ys) and verify the secant condition Bₖ₊₁ y = s for the BFGS inverse update (you may compute Bₖ₊₁ explicitly or reason using known properties).

Hint: First compute ys. Then either plug into the BFGS formula or use the theorem that the BFGS inverse update satisfies the secant equation by construction (provided ys ≠ 0).

Show solution

Compute ys = [3,1]·[1,1] = 4, so ρ = 1/4.

For BFGS inverse update:

Bₖ₊₁ = (I − ρ s yᵀ) Bₖ (I − ρ y sᵀ) + ρ s sᵀ.

With Bₖ = I, the update is well-defined since ys = 4 ≠ 0.

A defining property of the BFGS inverse update is that it satisfies the secant condition:

Bₖ₊₁ y = s.

Therefore, for this (s, y) pair, Bₖ₊₁ y must equal [1,1]ᵀ.

(If you compute Bₖ₊₁ explicitly, you will indeed find a symmetric matrix that maps y to s, confirming the condition.)

Connections #

Next nodes you can study:

Quality: A (4.5/5)

← back to treebrowse all →