Gradients #

CalculusDifficulty: ★★★☆☆Depth: 5Unlocks: 66

Vector of partial derivatives. Direction of steepest ascent.

Interactive Visualization #

⏮◀◀▶▶STEP0.25x1xZOOM

t=0s

Core Concepts #

-Gradient is the vector of partial derivatives of a scalar function (one component per variable).
-Gradient points in the direction of steepest ascent; its magnitude equals the maximal rate of increase.
-Gradient is orthogonal (normal) to the function's level sets (contours/surfaces).

Key Symbols & Notation #

nabla (del) operator written as 'nabla f' (aka 'grad f')

Essential Relationships #

-Directional derivative in unit direction u equals the dot product of the gradient with u: D_u f = (grad f) dot u.

Prerequisites (2) #

Multivariable Calculus6 atoms Vectors Introduction6 atoms

Unlocks (7) #

Optimization Introductionlvl 3 Gradient Descentlvl 3 Convex Functionslvl 3 Jacobianlvl 3 Matrix Calculuslvl 4 Lagrange Multiplierslvl 4 Multivariable Chain Rulelvl 3

Advanced Learning Details

Graph Position #

Depth Cost

Fan-Out (ROI)

Bottleneck Score

Chain Length

Cognitive Load #

Atomic Elements

Total Elements

Percentile Level

Atomic Level

All Concepts (10) #

- Gradient as a single object: the gradient ∇f of a scalar multivariable function is a vector formed from its partial derivatives
- Gradient as a vector field: at each point in the domain the gradient assigns a vector (depends on position)
- Direction of steepest ascent: the gradient vector at a point indicates the direction in which the function increases most rapidly
- Magnitude as maximal rate: the length (norm) of the gradient equals the maximal instantaneous rate of increase of the function
- Steepest descent: the negative of the gradient gives the direction of steepest decrease
- Gradient orthogonality to level sets: the gradient at a point is perpendicular (normal) to the level set (contour) of the function through that point
- Directional derivative via gradient: directional derivatives can be computed using the gradient and a direction vector
- Stationary/critical points via gradient: points where the gradient equals the zero vector are candidates for local extrema or saddle points
- First-order linear approximation (differential) using the gradient: the gradient defines the best linear approximation to the function at a point
- Use in iterative optimization: the gradient provides the update direction in gradient-based optimization methods (e.g., gradient descent/ascent)

Teaching Strategy #

Quick unlock - significant prerequisite investment but simple final step. Verify prerequisites first.

When you change a single-variable function f(x), the derivative f′(x) tells you “which way is uphill” and “how steep.” For a function of many variables f(x, y, z, …), the gradient ∇f plays that role: it’s the single object that tells you the uphill direction in the full space.

TL;DR:

The gradient ∇f(x) is the vector of partial derivatives. It points in the direction of steepest increase of f, its magnitude ‖∇f‖ is the maximum directional rate of increase, and it is perpendicular (normal) to level sets f(x) = c.

What Is Gradient? (∇f) — Definition with Intuition #

Why we need a new “derivative” in many dimensions #

In 1D, the derivative f′(x) answers two questions at once:

Direction: If f′(x) > 0, moving right increases f; if f′(x) < 0, moving right decreases f.
Rate: |f′(x)| tells how quickly f changes per unit step.

In multiple dimensions, you can move in infinitely many directions. A single number can’t encode “uphill direction” anymore. We need a vector that:

•has one component per input variable
•changes predictably with direction
•recovers the ordinary derivative when we move along a line

That vector is the gradient.

Definition #

Let f: ℝⁿ → ℝ be a differentiable scalar function, and let x = (x₁, …, xₙ). The gradient of f at x is

∇f(x) = (∂f/∂x₁, ∂f/∂x₂, …, ∂f/∂xₙ).

We will write gradients as vectors (bold): ∇f(x) is a vector in ℝⁿ.

What each component means #

The i-th component ∂f/∂xᵢ is the slope of f if you move only along the xᵢ axis, holding all other variables constant.

So the gradient collects all “axis-aligned slopes” into a single object. But the real power is that it also tells you what happens when you move in any direction, not just along coordinate axes.

A concrete picture (2D) #

For f(x, y):

•∂f/∂x tells how f changes if you nudge x while keeping y fixed.
•∂f/∂y tells how f changes if you nudge y while keeping x fixed.

Then

∇f(x, y) = (∂f/∂x, ∂f/∂y).

At a point (x, y), this vector points “uphill” on the surface z = f(x, y). But to make that statement precise, we need directional derivatives.

Notation: nabla (∇) #

The symbol ∇ (“nabla” or “del”) behaves like a vector of partial-derivative operators:

∇ = (∂/∂x₁, …, ∂/∂xₙ).

Applying it to a scalar function f gives ∇f.

Units sanity check #

If f has units (say, meters) and xᵢ has units (say, seconds), then ∂f/∂xᵢ has units meters/second. So ∇f is a vector of “rates per coordinate unit.” This becomes important when different coordinates have different scales (a common source of mistakes in optimization).

Core Mechanic 1: Directional Derivatives and “Steepest Ascent” #

Why directional derivatives matter #

Partial derivatives tell you slopes along coordinate axes. But most interesting motion is not axis-aligned: you might move diagonally in (x, y), or along an arbitrary direction in ℝⁿ.

The directional derivative formalizes “rate of change if I move in direction u.”

Directional derivative definition #

Let u be a unit vector (‖u‖ = 1). The directional derivative of f at x along u is

D_u f(x) = lim_{h→0} [f(x + hu) − f(x)] / h.

This is just the ordinary derivative of the 1D function g(h) = f(x + hu) at h = 0.

The key identity: directional derivative is a dot product #

If f is differentiable, then

D_u f(x) = ∇f(x) · u.

This is the bridge between “vector of partial derivatives” and “change in any direction.”

Showing the work (derivation sketch in ℝ²) #

Let f(x, y) be differentiable and u = (u₁, u₂) with ‖u‖ = 1. Consider

g(h) = f(x + h u₁, y + h u₂).

By the chain rule,

g′(h) = (∂f/∂x)(x + h u₁, y + h u₂)·u₁ + (∂f/∂y)(x + h u₁, y + h u₂)·u₂.

At h = 0,

D_u f(x, y) = g′(0)

= (∂f/∂x)(x, y)·u₁ + (∂f/∂y)(x, y)·u₂

= (∂f/∂x, ∂f/∂y) · (u₁, u₂)

= ∇f(x, y) · u.

The same reasoning extends to ℝⁿ.

Steepest ascent: why the gradient points uphill #

Now that directional change is a dot product, we can ask:

Among all unit directions u (‖u‖ = 1), which maximizes D_u f(x) = ∇f(x) · u?

Use the Cauchy–Schwarz inequality:

∇f · u ≤ ‖∇f‖‖u‖ = ‖∇f‖.

Because ‖u‖ = 1, we get

D_u f(x) ≤ ‖∇f(x)‖.

Equality holds exactly when u points in the same direction as ∇f:

u = ∇f / ‖∇f‖ (when ∇f ≠ 0).

So:

•Direction of steepest ascent: ∇f(x)
•Maximum rate of increase: ‖∇f(x)‖
•Direction of steepest descent: −∇f(x)

What about critical points? #

If ∇f(x) = 0, then every directional derivative is 0:

D_u f(x) = ∇f(x) · u = 0.

That doesn’t automatically mean a maximum or minimum (it could be a saddle). But it does mean the function has no first-order preference for any direction at that point.

Local linear approximation (the “first-order model”) #

Differentiability implies a local linear model:

f(x + h) ≈ f(x) + ∇f(x) · h (for small h).

This is the multivariable analog of

f(x + h) ≈ f(x) + f′(x)h.

This approximation is what makes gradients so useful in optimization: you can predict how f changes under a small step h, then choose h to increase or decrease f.

Core Mechanic 2: Gradients Are Orthogonal to Level Sets #

Why level sets are the right geometric object #

For f: ℝ² → ℝ, you often visualize the function using contours (level curves):

f(x, y) = c.

For f: ℝ³ → ℝ, you get level surfaces:

f(x, y, z) = c.

These describe “where the function value stays constant.” If you walk along a contour, f does not change.

So the key question becomes:

If I move tangent to a level set, why does the gradient end up perpendicular to that motion?

Tangent directions have zero directional derivative #

Suppose you are at a point x on the level set f(x) = c. Consider a small step t that stays on the level set (to first order). Along such a step, f doesn’t change, so the directional derivative in that tangent direction should be 0.

Using the dot-product identity:

0 = D_u f(x) = ∇f(x) · u

for any unit tangent direction u.

A vector whose dot product with every tangent direction is 0 must be normal to the level set. Therefore:

∇f(x) is orthogonal to the level set f(x) = c at x (when ∇f(x) ≠ 0).

Showing the work with an implicit curve (ℝ²) #

Let the level curve be given implicitly by f(x, y) = c. Suppose it can be parameterized as (x(t), y(t)) staying on the curve. Then

f(x(t), y(t)) = c.

Differentiate both sides with respect to t:

d/dt f(x(t), y(t)) = 0.

Apply the chain rule:

(∂f/∂x)·x′(t) + (∂f/∂y)·y′(t) = 0.

Rewrite in dot-product form:

(∂f/∂x, ∂f/∂y) · (x′(t), y′(t)) = 0

So:

∇f · r′(t) = 0,

meaning ∇f is perpendicular to the tangent vector r′(t).

Practical consequence: contour plots + gradient arrows #

If you see a contour map of f(x, y):

•The gradient points across the contours, not along them.
•Where contours are close together, ‖∇f‖ is large (steeper change).

This is a powerful intuition for optimization: to decrease f fastest, move opposite the gradient, cutting across contours as directly as possible.

The gradient as a normal vector for constraints #

This orthogonality is also the reason gradients appear in constrained optimization. If you constrain x to live on a level set g(x) = 0, then ∇g(x) is normal to the constraint surface. At an optimum, the objective’s gradient must align with the constraint’s normal direction (Lagrange multipliers formalize this).

Summary of the geometric picture #

You can hold three ideas in your head simultaneously:

∇f is built from partial derivatives.
∇f gives first-order change: f(x + h) ≈ f(x) + ∇f · h.
Level sets are perpendicular to ∇f, and the steepest-ascent direction is ∇f itself.

These are not separate facts—they are the same fact seen through different lenses.

Application / Connection: Optimization, ML, and Beyond #

Why gradients are the workhorse of optimization #

Most optimization problems can be phrased as

minimize f(x) or maximize f(x).

The gradient tells you the direction that increases f most. So to decrease f, you go the other way.

A basic iterative method is gradient descent:

xₖ₊₁ = xₖ − α ∇f(xₖ),

where α > 0 is the step size (learning rate).

This update is directly motivated by the linear approximation:

f(x + h) ≈ f(x) + ∇f(x) · h.

If you pick h = −α ∇f, then

∇f · h = ∇f · (−α ∇f) = −α ‖∇f‖² ≤ 0,

so the first-order prediction says the objective should go down (for small enough α).

Gradients in machine learning #

In supervised learning, you often minimize a loss function like

L(w) = (1/m) ∑ᵢ ℓ(ŷᵢ(w), yᵢ),

where w are model parameters.

The gradient ∇L(w) tells you how to change parameters to reduce loss. Backpropagation is essentially an efficient way to compute these partial derivatives for neural networks.

Relationship to Jacobian and Hessian #

Gradients are the “scalar-output” case of more general derivative objects.

Object	Maps	Output	Meaning
Gradient ∇f	f: ℝⁿ → ℝ	vector in ℝⁿ	slopes of scalar function
Jacobian J	F: ℝⁿ → ℝᵐ	m×n matrix	linear map of first-order changes
Hessian ∇²f	f: ℝⁿ → ℝ	n×n matrix	curvature / second derivatives

If F has components Fⱼ, then the Jacobian rows are gradients:

J(x) has row j equal to (∇Fⱼ(x))ᵀ.

The Hessian shows up when the gradient alone is not enough (e.g., Newton’s method uses curvature to choose better steps).

Coordinate scaling and geometry (important in practice) #

A subtle but practical point: the gradient depends on your coordinate system and scaling.

If one feature is measured in kilometers and another in millimeters, a unit step in each coordinate means wildly different real-world changes. Then “steepest” under the usual Euclidean norm might not align with what you intend.

This is why feature scaling (standardization) matters in ML: it changes the geometry so gradient-based methods behave well.

Level sets, constraints, and Lagrange multipliers (preview) #

If you maximize f(x) subject to g(x) = 0, the feasible directions are tangent to the level set g(x) = 0. At an optimum, moving in any tangent direction can’t improve f, so ∇f must be orthogonal to the tangent space—i.e., it must be parallel to ∇g.

This becomes the condition

∇f(x*) = λ ∇g(x*),

which is the heart of Lagrange multipliers.

Mental model to carry forward #

When you see ∇f in later topics, interpret it as:

•the best local linear summary of a scalar function’s behavior
•a direction (where to move)
•a normal vector (what surfaces look like)
•the engine for optimization algorithms

Worked Examples (3) #

Compute a gradient and use it to predict change #

Let f(x, y) = x²y + 3y. Compute ∇f(x, y). Then at (x, y) = (2, 1), predict the change in f for a small step h = (0.1, −0.05) using the linear approximation.

Compute partial derivatives:
∂f/∂x = ∂/∂x (x²y + 3y)
= 2xy.
∂f/∂y = ∂/∂y (x²y + 3y)
= x² + 3.
Assemble the gradient:
∇f(x, y) = (2xy, x² + 3).
Evaluate at (2, 1):
∇f(2, 1) = (2·2·1, 2² + 3)
= (4, 7).
Use the first-order approximation:
Δf ≈ ∇f(2, 1) · h
= (4, 7) · (0.1, −0.05)
= 4(0.1) + 7(−0.05)
= 0.4 − 0.35
= 0.05.

Insight: The gradient converts a small vector step h into an approximate scalar change via a dot product. The sign comes from alignment: the step had a small component against the y-gradient (−0.05 versus +7), nearly canceling the x increase.

Steepest ascent direction and maximum increase rate #

Let f(x, y) = x eʸ. At the point (0, 0), find (1) the unit direction u of steepest ascent and (2) the maximum directional derivative value.

Compute the gradient:
∂f/∂x = eʸ.
∂f/∂y = x eʸ.
So ∇f(x, y) = (eʸ, x eʸ).
Evaluate at (0, 0):
∇f(0, 0) = (e⁰, 0·e⁰)
= (1, 0).
Direction of steepest ascent is the normalized gradient:
u = ∇f / ‖∇f‖.
Here ‖∇f(0,0)‖ = √(1² + 0²) = 1.
So u = (1, 0).
Maximum directional derivative equals ‖∇f‖:
max_{‖u‖=1} D_u f(0,0) = ‖∇f(0,0)‖ = 1.

Insight: At (0,0), changing y does nothing to first order because the y-slope is proportional to x. The gradient correctly captures that the only immediate increase comes from moving in +x.

Gradient is orthogonal to a level set (circle example) #

Consider f(x, y) = x² + y². Show that ∇f is perpendicular to the level set f(x, y) = 1 at a point (x, y) on the circle.

Compute the gradient:
∇f(x, y) = (∂f/∂x, ∂f/∂y)
= (2x, 2y).
Parameterize the level set f(x, y) = 1 by
r(t) = (cos t, sin t).
Then r′(t) = (−sin t, cos t), which is tangent to the circle.
Evaluate ∇f on the circle:
At r(t), ∇f = (2cos t, 2sin t).
Check orthogonality using a dot product:
∇f · r′(t)
= (2cos t, 2sin t) · (−sin t, cos t)
= 2cos t(−sin t) + 2sin t(cos t)
= −2sin t cos t + 2sin t cos t
= 0.

Insight: For circles centered at the origin, ∇f points radially outward while tangents go around the circle. “Radial” and “tangent” directions are perpendicular—this is the level-set orthogonality principle in a familiar shape.

Key Takeaways #

✓
∇f(x) is the vector of partial derivatives: one component per input variable.
✓
Directional derivatives satisfy D_u f(x) = ∇f(x) · u for any unit direction u.
✓
The steepest-ascent direction is ∇f/‖∇f‖ (when ∇f ≠ 0), and the maximal increase rate is ‖∇f‖.
✓
The first-order approximation is f(x + h) ≈ f(x) + ∇f(x) · h for small h.
✓
∇f is orthogonal to level sets f(x) = c (it is a normal vector to contours/surfaces).
✓
If ∇f(x) = 0, then all first-order directional changes vanish; this identifies critical points (not necessarily minima).
✓
In optimization and ML, gradients drive iterative updates like x ← x − α∇f(x).

Common Mistakes #

✗
Forgetting to evaluate the gradient at the point of interest (carrying around ∇f(x, y) but never plugging in (x₀, y₀)).
✗
Using a non-unit direction in a directional derivative without accounting for its length (D_u assumes ‖u‖ = 1 when interpreting “rate per unit distance”).
✗
Confusing “orthogonal to level sets” with “points toward the origin” (true for x² + y² but not general).
✗
Assuming ∇f = 0 implies a minimum; it could be a maximum or saddle without second-order analysis.

Practice #

easy

Let f(x, y) = 3x² − 2xy + y². Compute ∇f(x, y) and evaluate it at (1, 2).

Hint: Differentiate with respect to x holding y constant, then with respect to y holding x constant.

Show solution

∂f/∂x = 6x − 2y.

∂f/∂y = −2x + 2y.

So ∇f(x, y) = (6x − 2y, −2x + 2y).

At (1, 2): ∇f(1, 2) = (6·1 − 4, −2·1 + 4) = (2, 2).

medium

For f(x, y) = x² + 4y², find the unit direction u at (1, 1) that gives the fastest decrease, and compute the corresponding directional derivative value.

Hint: Fastest decrease is in direction −∇f/‖∇f‖. The minimum directional derivative over unit vectors equals −‖∇f‖.

Show solution

∇f(x, y) = (2x, 8y). So ∇f(1, 1) = (2, 8).

‖∇f(1,1)‖ = √(2² + 8²) = √(4 + 64) = √68 = 2√17.

Fastest decrease unit direction:

u = −(2, 8)/√68 = (−2/√68, −8/√68).

The corresponding directional derivative:

D_u f = ∇f · u = −‖∇f‖ = −2√17.

medium

Let f(x, y) = x + y. Consider the level set f(x, y) = 5 (a line). Show that ∇f is orthogonal to this level set by computing a tangent direction and taking a dot product.

Hint: A level set x + y = 5 can be parameterized by (t, 5 − t).

Show solution

∇f(x, y) = (1, 1) everywhere.

Parameterize the level set: r(t) = (t, 5 − t).

Then r′(t) = (1, −1), which is tangent.

Dot product: ∇f · r′(t) = (1, 1) · (1, −1) = 1 − 1 = 0.

So ∇f is orthogonal to the level set.

Connections #

Optimization Introduction

Quality: A (4.5/5)

← back to tree browse all →