Meta-Learning

←Back to Tech Tree

inventorycoverage

Meta-Learning #

Machine LearningDifficulty: ★★★★★Depth: 13Unlocks: 0

Learning to learn. Few-shot learning, MAML.

Interactive Visualization #

⏮◀◀▶▶STEP0.25x1xZOOM

t=0s

Core Concepts #

Key Symbols & Notation #

theta (meta-parameters / initialization)theta_prime (adapted task-specific parameters after inner update)

Essential Relationships #

Prerequisites (2) #

Deep Learning6 atomsStochastic Gradient Descent5 atoms

Advanced Learning Details

Graph Position #

215

Depth Cost

0

Fan-Out (ROI)

0

Bottleneck Score

13

Chain Length

Cognitive Load #

6

Atomic Elements

51

Total Elements

L4

Percentile Level

L4

Atomic Level

All Concepts (22) #

Teaching Strategy #

Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.

A normal ML pipeline learns one task at a time: it starts random (or pretrained), sees lots of data, and slowly becomes good. Meta-learning flips the question: can we train a system so that seeing just a few examples from a brand-new task is enough to adapt immediately?

TL;DR:

Meta-learning (“learning to learn”) trains over a distribution of tasks. In gradient-based meta-learning (e.g., MAML), we learn meta-parameters θ (often an initialization) so that a small inner-loop update on a new task produces adapted parameters θ′ with low task loss. The outer loop optimizes θ by differentiating through the inner-loop adaptation across many tasks.

What Is Meta-Learning? #

Why meta-learning exists (motivation) #

In standard supervised learning, we assume a single task: one input space, one label space, one dataset, and one loss. If you want to solve a new but related task, you often retrain or fine-tune. That works, but it’s wasteful and slow when:

Humans seem to have a prior over tasks: you can learn a new character in a foreign alphabet from a couple examples because you’ve learned how learning tends to work in that domain. Meta-learning aims to build that capability into ML systems.

The key shift: from “one dataset” to “a distribution of tasks” #

Meta-learning is not defined by a specific model class (you can meta-learn neural nets, linear models, optimizers). It’s defined by the training setup:

A common formalization:

The meta-learner uses Dₛ(𝒯) to adapt quickly, then is evaluated on D_q(𝒯). Meta-training adjusts shared structure so that this procedure works well for tasks drawn from p(𝒯).

What does “learning to learn” mean concretely? #

A useful way to think about it is as a two-level optimization:

  1. 1)Inner loop (fast adaptation): For a specific task 𝒯, update task-specific parameters using a small amount of data.
  2. 2)Outer loop (meta-learning): Update shared meta-parameters so that the inner-loop adaptation yields low loss on new data for that task.

In gradient-based meta-learning, the shared parameters are often an initialization θ. Given a new task, you start from θ and take one or a few gradient steps to obtain θ′ (task-adapted parameters).

When meta-learning is the right tool #

Meta-learning is strongest when tasks are related but not identical.

Examples:

If tasks are unrelated, no method can reliably transfer. If tasks are identical, ordinary training already works.

A small vocabulary (to keep us aligned) #

TermMeaningTypical symbol
Task distributionHow tasks are generatedp(𝒯)
TaskA specific problem instance with its own loss𝒯
Support setFew-shot data used for adaptationDₛ(𝒯)
Query setData used to evaluate meta-objectiveD_q(𝒯)
Meta-parametersShared parameters across tasksθ
Adapted parametersParameters after inner update for task 𝒯θ′

The rest of the lesson focuses on a canonical approach: MAML (Model-Agnostic Meta-Learning), which cleanly illustrates the inner/outer loop idea and the role of θ and θ′.

Core Mechanic 1: Task Distributions and Episodic Training #

Why episodic training matters #

If the goal is: “perform well after seeing only a few examples from a new task,” then the training procedure should match that goal.

Episodic (a.k.a. meta-training) simulates test-time conditions repeatedly:

This forces the model to practice adapting under data scarcity.

A more formal view: what is a task? #

A task 𝒯 typically specifies:

For supervised learning, a common choice is cross-entropy over D.

The meta-learning assumption is:

You can imagine a hidden variable z controlling each task, e.g., “which classes are chosen,” “which sinusoid parameters,” or “which environment layout.” Even if we don’t explicitly model z, meta-learning tries to learn parameters that work well across the induced distribution.

Support vs query: the train/eval split inside each task #

Within each task 𝒯, we split data into:

This is subtle but crucial:

This resembles cross-validation, but nested inside a task distribution.

The meta-objective in words #

We want θ such that:

So the outer objective is an expectation over tasks:

MetaLoss(θ) = 𝔼_{𝒯 ∼ p(𝒯)} [ ℒ_𝒯(θ′(𝒯); D_q(𝒯)) ]

But θ′(𝒯) is itself produced by a learning rule (inner loop), usually gradient descent.

Typical few-shot classification episode #

A standard benchmark structure is N-way K-shot:

You run many episodes, each with different class subsets.

What changes compared to “normal” training? #

Normal training:

Meta-training (episodic):

A practical comparison:

AspectStandard supervised learningMeta-learning (episodic)
Unit of samplingExample (x, y)Task 𝒯 (support + query)
ObjectiveLow loss on datasetLow loss after adaptation
Generalization targetNew examples from same taskNew tasks from p(𝒯)
Overfitting riskOverfit datasetOverfit task distribution

The “model of learning” perspective #

Meta-learning is sometimes described as learning a model of learning: the algorithm itself is trained.

Concretely, you are no longer just learning a function f_θ(x) → y.

You are learning parameters θ such that the procedure

θ → (adapt using Dₛ(𝒯)) → θ′(𝒯) → predictions on D_q(𝒯)

works well across tasks.

This sets up the next mechanic: the fast inner loop and how θ′ is computed.

Core Mechanic 2: Fast Adaptation (Inner Loop) and MAML’s Outer Loop #

Why gradient-based meta-learning is appealing #

If you already know SGD and backprop, a natural idea is: “Can we meta-learn an initialization that fine-tunes quickly?”

This is the core of MAML:

Inner loop: from θ to θ′ #

For a task 𝒯 with support set Dₛ, define the support loss:

ℒₛ(θ) = ℒ_𝒯(θ; Dₛ(𝒯))

A single gradient step with step size α gives:

θ′ = θ − α ∇_θ ℒₛ(θ)

This is the fast adaptation step. With multiple inner steps, you iterate:

θ⁽⁰⁾ = θ

θ⁽i+1⁾ = θ⁽i⁾ − α ∇_{θ⁽i⁾} ℒ_𝒯(θ⁽i⁾; Dₛ)

and set θ′ = θ⁽m⁾.

Outer loop: optimize θ for post-adaptation performance #

Now evaluate on the query set:

ℒ_q(θ′) = ℒ_𝒯(θ′; D_q(𝒯))

The meta-objective across tasks is:

min_θ 𝔼_{𝒯 ∼ p(𝒯)} [ ℒ_𝒯(θ′(θ, 𝒯); D_q(𝒯)) ]

The key is that θ′ depends on θ. Therefore, when we compute the meta-gradient ∇_θ ℒ_q(θ′), we must differentiate through the inner update.

The chain rule moment (where MAML becomes “meta”) #

For one inner step:

θ′(θ) = θ − α ∇_θ ℒₛ(θ)

The meta-gradient is:

∇_θ ℒ_q(θ′(θ))

Using the chain rule:

∇_θ ℒ_q(θ′) = (∂θ′/∂θ)ᵀ ∇_{θ′} ℒ_q(θ′)

Compute ∂θ′/∂θ:

θ′ = θ − α ∇_θ ℒₛ(θ)

Differentiate w.r.t. θ:

∂θ′/∂θ = I − α ∂(∇_θ ℒₛ(θ))/∂θ

But ∂(∇_θ ℒₛ)/∂θ is the Hessian:

∂(∇_θ ℒₛ(θ))/∂θ = ∇²_θ ℒₛ(θ)

So:

∂θ′/∂θ = I − α ∇²_θ ℒₛ(θ)

Therefore:

∇_θ ℒ_q(θ′)

= (I − α ∇²_θ ℒₛ(θ))ᵀ ∇_{θ′} ℒ_q(θ′)

If the Hessian is symmetric (common), transpose doesn’t change it:

∇_θ ℒ_q(θ′)

= (I − α ∇²_θ ℒₛ(θ)) ∇_{θ′} ℒ_q(θ′)

This is why MAML is considered “second-order”: it involves Hessian-vector products.

First-Order MAML (FOMAML) and why people use it #

Computing the exact meta-gradient can be expensive. A common approximation is to ignore the Hessian term:

∂θ′/∂θ ≈ I

Then:

∇_θ ℒ_q(θ′) ≈ ∇_{θ′} ℒ_q(θ′)

This is FOMAML. It often works surprisingly well, trading some accuracy for speed and memory.

Reptile (related intuition) #

Another popular first-order method is Reptile, which repeatedly:

Update:

θ ← θ + ε(θ′ − θ)

Reptile can be derived as optimizing a meta-objective that encourages within-task generalization. It’s simpler (no second-order terms) and sometimes competitive.

What is actually being learned? #

It’s tempting to say: “MAML learns an initialization.” That’s true, but incomplete.

MAML learns θ such that:

In geometric terms, θ is placed in parameter space near many task-specific optima, in a way that a small step can reach each optimum.

Inner-loop hyperparameters are part of the story #

The adaptation rule includes choices:

These strongly affect performance. Sometimes α is itself meta-learned.

Practical training loop (conceptual) #

For each meta-iteration:

  1. 1)Sample batch of tasks {𝒯ᵢ}.
  2. 2)For each 𝒯ᵢ:
  1. 3)Meta-update:

Here β is the outer learning rate.

At meta-test time:

This completes the central mechanism: θ is trained so that θ → θ′ quickly yields good performance.

Application/Connection: Few-Shot Learning, Meta-Overfitting, and Practical Variants #

Few-shot learning: where MAML is often introduced #

In few-shot classification, the model must build a classifier for novel classes from very few labeled examples.

Two broad families of approaches:

FamilyCore ideaExamples
Metric-basedLearn an embedding where nearest-neighbor worksPrototypical Networks, Matching Networks
Optimization-basedLearn parameters/initialization to optimize quicklyMAML, FOMAML, Reptile

MAML’s advantage is flexibility: it can adapt the whole network, not just a linear head. Its disadvantage is computational cost.

Regression and RL: why “model-agnostic” matters #

Because MAML only assumes differentiability, it applies to:

In RL, tasks might be different environments. The inner loop becomes one or a few policy-gradient updates; the query loss is expected return after adaptation.

The hidden danger: meta-overfitting #

Meta-learning can overfit in two ways:

  1. 1)Within-task overfitting: adapting too strongly to the small support set.
  2. 2)Across-task overfitting (meta-overfitting): learning θ that works well on meta-training tasks but not on meta-test tasks.

Meta-overfitting is especially likely if:

Practical mitigations:

Computation and memory: why second-order is hard #

Exact MAML requires differentiating through inner-loop computation graphs.

Costs:

Common workarounds:

What you get from meta-learning (and what you don’t) #

Meta-learning is not magic. It leverages task similarity. You should expect:

Interpreting θ and θ′ in practice #

It helps to make θ and θ′ concrete:

The quality of meta-learning is measured by how good θ′ becomes given a tiny Dₛ.

Connections to other ideas #

A simple mental model #

If you imagine each task has its own optimal parameters θ*(𝒯), then MAML tries to find θ such that:

This is why, during meta-training, you must repeatedly practice: adapt on support, evaluate on query.

When to choose MAML vs metric-based methods #

CriterionMetric-based (Prototypical)MAML
Speed at meta-testVery fastRequires inner optimization
FlexibilityOften assumes class structureWorks for many losses/settings
Implementation complexityModerateHigh (unrolling, stability)
Best whenEmbedding is sufficientTask requires deeper adaptation

If your tasks differ mainly in labels/classes but share representation, metric methods are strong. If tasks require changing internal features or dynamics, optimization-based meta-learning can shine.

This closes the loop: meta-learning is a training paradigm over tasks, with a fast inner adaptation and a meta-objective optimizing θ so that θ′ performs well after few-shot updates.

Worked Examples (3) #

Worked Example 1: One-Step MAML Meta-Gradient in 1D (Scalar θ) #

Consider a simple 1D parameter θ ∈ ℝ. For a sampled task 𝒯, define the support loss ℒₛ(θ) and query loss ℒ_q(θ). We do one inner step: θ′ = θ − α dℒₛ/dθ. We want dℒ_q(θ′)/dθ.

  1. Inner update (one step):

    θ′(θ) = θ − α (dℒₛ(θ)/dθ)

  2. Differentiate θ′ w.r.t. θ:

    dθ′/dθ = 1 − α d/dθ (dℒₛ/dθ)

    = 1 − α (d²ℒₛ/dθ²)

  3. Apply chain rule to the meta-objective:

    d/dθ ℒ_q(θ′(θ)) = (dℒ_q/dθ′) · (dθ′/dθ)

  4. Substitute the expression for dθ′/dθ:

    dℒ_q(θ′)/dθ = (dℒ_q/dθ′) · (1 − α d²ℒₛ/dθ²)

Insight: Even in 1D, the meta-gradient includes a curvature term from the support loss. MAML is optimizing θ not just for low loss, but for producing useful gradients from few examples.

Worked Example 2: Linear Regression Task Family and “Good Initialization” Geometry #

Suppose tasks are 1D linear regression with parameter a (task-specific slope). For each task 𝒯, data is y = a x with small noise. The model is f_θ(x) = θ x. Support set has a few (x, y) pairs. Show how one gradient step moves θ toward a.

  1. Define mean-squared error on support set Dₛ:

    ℒₛ(θ) = (1/|Dₛ|) ∑(θ xᵢ − yᵢ)²

  2. Use yᵢ = a xᵢ (ignore noise for clarity):

    θ xᵢ − yᵢ = θ xᵢ − a xᵢ = (θ − a) xᵢ

  3. Rewrite the loss:

    ℒₛ(θ) = (1/|Dₛ|) ∑ ((θ − a) xᵢ)²

    = (θ − a)² · (1/|Dₛ|) ∑ xᵢ²

  4. Compute the gradient:

    dℒₛ/dθ = 2(θ − a) · (1/|Dₛ|) ∑ xᵢ²

  5. One inner gradient step:

    θ′ = θ − α · 2(θ − a) · (1/|Dₛ|) ∑ xᵢ²

  6. Factor the update:

    θ′ = θ − c(θ − a) where c = 2α(1/|Dₛ|) ∑ xᵢ²

    So:

    θ′ = (1 − c) θ + c a

Insight: For this family, one step is a convex combination of θ and the task slope a (if 0 < c < 1). Meta-learning θ amounts to choosing an initialization that is close (on average) to task-specific optima so that one step lands near a.

Worked Example 3: FOMAML Approximation vs Full MAML (What You Drop) #

For one inner step: θ′ = θ − α ∇_θ ℒₛ(θ). Full MAML uses ∇_θ ℒ_q(θ′(θ)). FOMAML approximates this gradient. Write both explicitly to see the difference.

  1. Full MAML meta-gradient:

    ∇_θ ℒ_q(θ′) = (∂θ′/∂θ)ᵀ ∇_{θ′} ℒ_q(θ′)

  2. Compute the Jacobian:

    ∂θ′/∂θ = I − α ∇²_θ ℒₛ(θ)

  3. So full MAML is:

    ∇_θ ℒ_q(θ′) = (I − α ∇²_θ ℒₛ(θ))ᵀ ∇_{θ′} ℒ_q(θ′)

  4. FOMAML approximation sets:

    ∂θ′/∂θ ≈ I

    Therefore:

    ∇_θ ℒ_q(θ′) ≈ ∇_{θ′} ℒ_q(θ′)

Insight: FOMAML treats θ′ as if it were independent of θ when computing the gradient. You keep the benefit of adapting in the inner loop, but you ignore how changing θ changes the adaptation trajectory.

Key Takeaways #

Common Mistakes #

Practice #

easy

You have tasks 𝒯 ∼ p(𝒯). For each task you compute θ′ = θ − α ∇_θ ℒ_𝒯(θ; Dₛ). Write the meta-objective using a query set D_q and describe (in one sentence) what it encourages.

Hint: Use an expectation over tasks and evaluate loss at θ′ on D_q.

Show solution

MetaLoss(θ) = 𝔼_{𝒯 ∼ p(𝒯)} [ ℒ_𝒯(θ′; D_q(𝒯)) ], where θ′ = θ − α ∇_θ ℒ_𝒯(θ; Dₛ(𝒯)). It encourages choosing θ so that a small gradient-based adaptation using Dₛ yields parameters that generalize well to new data D_q from the same task.

medium

Derive ∂θ′/∂θ for one inner step θ′ = θ − α ∇_θ ℒₛ(θ), and identify where the Hessian appears.

Hint: Differentiate both sides w.r.t. θ; derivative of a gradient is a Hessian.

Show solution

Differentiate: ∂θ′/∂θ = ∂/∂θ [θ − α ∇_θ ℒₛ(θ)] = I − α ∂(∇_θ ℒₛ(θ))/∂θ = I − α ∇²_θ ℒₛ(θ). The Hessian appears as the Jacobian of the gradient.

hard

In the linear regression family y = a x (task slope a), suppose one inner step yields θ′ = (1 − c)θ + c a for some 0 < c < 1 (as derived in the lesson). If tasks have slopes a with mean 𝔼[a] = μ, what initialization θ minimizes expected squared error 𝔼[(θ − a)²] before adaptation? What does that suggest about a reasonable meta-initialization when only one small step is allowed?

Hint: Minimizing 𝔼[(θ − a)²] over θ gives θ = 𝔼[a].

Show solution

We minimize J(θ) = 𝔼[(θ − a)²]. Differentiate: dJ/dθ = 𝔼[2(θ − a)] = 2(θ − 𝔼[a]). Setting to 0 gives θ* = 𝔼[a] = μ. This suggests that when only a limited adaptation is possible, a good meta-initialization is near the average task optimum; the inner step then nudges θ toward each specific a.

Connections #

Related nodes:

Quality: A (4.4/5)

← back to treebrowse all →