Machine Learning Introduction

←Back to Tech Tree

inventorycoverage

Machine Learning Introduction #

Machine LearningDifficulty: ★★★☆☆Depth: 8Unlocks: 26

Learning patterns from data. Supervised vs unsupervised learning.

Interactive Visualization #

⏮◀◀▶▶STEP0.25x1xZOOM

t=0s

Core Concepts #

Key Symbols & Notation #

f(x; theta) - parameterized model (inputs x, parameters theta)

Essential Relationships #

Prerequisites (2) #

Maximum Likelihood Estimation6 atomsGradient Descent6 atoms

Unlocks (6) #

Logistic Regressionlvl 3Bias-Variance Tradeofflvl 4Loss Functionslvl 4Decision Treeslvl 4Linear Regressionlvl 3Clusteringlvl 4

Referenced by (1) #

Where this concept shows up in the operating-finance and personal-finance graphs.

From Business (1) #

[pipelineBusiness

The DATA → MODEL → OBJECTIVE pipeline is the canonical supervised learning workflow introduced here: input data, a learnable model, and a loss to optimize](/business/pipeline/)

Advanced Learning Details

Graph Position #

106

Depth Cost

26

Fan-Out (ROI)

11

Bottleneck Score

8

Chain Length

Cognitive Load #

5

Atomic Elements

50

Total Elements

L4

Percentile Level

L3

Atomic Level

All Concepts (22) #

Teaching Strategy #

Quick unlock - significant prerequisite investment but simple final step. Verify prerequisites first.

Machine learning is the idea that instead of hand-writing rules (“if the email contains ‘free money’, mark as spam”), we let a program learn a pattern from examples. The central move is surprisingly simple: pick a flexible function f(x; θ), define what it means to be wrong with a loss, and use data to find θ that makes the loss small.

TL;DR:

Machine learning = choosing a parameterized model f(x; θ) and fitting parameters θ by minimizing an empirical risk (average loss on a dataset). In supervised learning, data includes labels y so the loss measures prediction error; in unsupervised learning, there are no labels, so the objective captures structure (e.g., clustering, density). Training is optimization (often gradient descent), and generalization is the goal (performing well on new data).

What Is Machine Learning Introduction? #

Why “learning” at all? #

In many problems, writing explicit rules is either too hard or too brittle.

Rules struggle because the mapping from inputs to outputs is complex and full of edge cases. Machine learning changes the workflow:

  1. 1)Collect data that represents the problem.
  2. 2)Choose a model family (a set of functions) that could plausibly represent the relationship.
  3. 3)Define an objective that measures how well a model fits the data.
  4. 4)Optimize the objective to find parameters.
  5. 5)Evaluate on new (held-out) data to check generalization.

The core abstraction: a parameterized model #

A machine learning model is a parameterized function:

We write this as:

f(x; θ)

Examples of what f might represent:

The key is that f is not a single function until you choose θ. The learning problem is: find θ from data.

Data as a dataset, not a single example #

Typically we have n examples. In supervised learning we write:

D = { (xᵢ, yᵢ) }ᵢ₌₁ⁿ

In unsupervised learning we usually have only inputs:

D = { xᵢ }ᵢ₌₁ⁿ

Learning as “fit on data” but “perform in the world” #

A common confusion: if we minimize error on the training set, aren’t we done?

Not quite. We care about performance on new data from the same underlying process. So we distinguish:

A model can fit training data extremely well yet generalize poorly (overfitting). This lesson sets up the objects we need to discuss that later (e.g., in Bias-Variance Tradeoff).

A unifying view #

Machine learning problems often look different on the surface (regression vs classification vs clustering), but underneath they share a template:

  1. 1)Choose f(x; θ)
  2. 2)Choose an objective J(θ)
  3. 3)Compute θ̂ = argmin_θ J(θ)

The rest of the field is largely about better choices for f, better objectives, and better optimization/evaluation procedures.

Core Mechanic 1: Parameterized Models f(x; θ) #

Why start with parameterized models? #

We need a way to turn “patterns in data” into a concrete computational object. A parameterized model does exactly that: it defines a space of candidate functions indexed by θ.

Think of θ as coordinates in a “function space.” Each θ picks a specific function f(·; θ).

Inputs as vectors, parameters as vectors #

In many ML settings:

We often use boldface for vectors: x, w, θ.

A classic example is a linear score function:

s(x; w, b) = wx + b

You can fold b into w by augmenting the input:

Let x′ = [x; 1] and w′ = [w; b]

Then:

s(x; w, b) = (w′)ᵀ x′

This “feature augmentation” is simple, but it’s a good reminder: the choice of representation can simplify everything.

What does the model output mean? #

Different tasks interpret f(x; θ) differently.

Task typeOutput f(x; θ)Typical interpretation
Regressionreal number ŷ ∈ ℝpredicted value
Binary classificationprobability p ∈ [0, 1]P(y = 1
Multi-class classificationvector p ∈ [0,1]ᵏ, ∑ pⱼ = 1class distribution
Representation learningembedding z ∈ ℝᵐlearned features

The same “score” s(x; θ) might be post-processed:

Model capacity (informal) #

A parameterized model family can be:

You don’t need a formal definition of capacity yet, but you should feel the tradeoff: richer f can fit more, but may generalize worse without regularization or enough data.

A geometric intuition #

For linear models, learning often becomes geometry.

Even for non-linear models, optimization still adjusts θ to reshape decision boundaries or function curves.

Model ≠ algorithm #

A subtle but crucial distinction:

Different training algorithms can fit the same model family. And the same training algorithm can be used for different model families.

This separation is what makes ML modular.

Core Mechanic 2: Empirical Risk (Training Objective) #

Why we need an objective #

To “learn,” we need a quantitative notion of goodness. That means a scalar objective function we can minimize.

In supervised learning, the objective is usually based on a loss function ℓ(ŷ, y) that measures how bad a prediction is for one example.

Then we aggregate loss over the dataset.

Per-example loss → dataset objective #

Given D = { (xᵢ, yᵢ) }ᵢ₌₁ⁿ and predictions ŷᵢ = f(xᵢ; θ), define empirical risk:

R̂(θ) = (1/n) ∑ᵢ₌₁ⁿ ℓ(f(xᵢ; θ), yᵢ)

Training typically means:

θ̂ = argmin_θ R̂(θ)

This is called Empirical Risk Minimization (ERM).

Why “empirical”? #

Because it is computed on the empirical sample you observed, not on the full (unknown) data-generating distribution.

If the true distribution of (x, y) is P, then the population (true) risk is:

R(θ) = E₍x,y₎∼P[ ℓ(f(x; θ), y) ]

But we cannot compute R(θ) exactly because P is unknown. We approximate it with R̂(θ).

This is the bridge between data and generalization:

Connection to Maximum Likelihood Estimation (MLE) #

You already know MLE. Many ML objectives are MLE in disguise.

Suppose your model defines a conditional distribution p(y | x; θ). The likelihood on data is:

L(θ) = ∏ᵢ₌₁ⁿ p(yᵢ | xᵢ; θ)

MLE chooses:

θ̂ = argmax_θ ∏ᵢ p(yᵢ | xᵢ; θ)

It is more convenient to maximize log-likelihood:

log L(θ) = ∑ᵢ log p(yᵢ | xᵢ; θ)

Maximizing log-likelihood is equivalent to minimizing negative log-likelihood (NLL):

θ̂ = argmin_θ −∑ᵢ log p(yᵢ | xᵢ; θ)

Now compare with ERM:

R̂(θ) = (1/n) ∑ᵢ ℓ(f(xᵢ; θ), yᵢ)

If we set:

ℓᵢ(θ) = −log p(yᵢ | xᵢ; θ)

then minimizing empirical risk equals MLE (up to the constant factor 1/n).

So one major story of ML is:

Choose a probabilistic model p(y|x;θ) ⇒ derive a loss (NLL) ⇒ minimize it.

Regularization: adding a preference over θ #

Often we train not just by fitting data, but also by discouraging overly complex parameter settings.

A common regularized objective is:

J(θ) = (1/n) ∑ᵢ ℓ(f(xᵢ; θ), yᵢ) + λ Ω(θ)

Where:

From a probabilistic view, regularization often corresponds to a prior on θ (MAP estimation), but you don’t need that machinery to use it.

Optimization: where gradient descent enters #

Given J(θ), gradient descent performs updates:

θ ← θ − α ∇θ J(θ)

Because you already know gradient descent, the new idea here is mostly: what is J? ML is largely the craft of choosing J so that minimizing it produces useful behavior.

Training, validation, test (briefly) #

To estimate generalization, we split data:

The key is to avoid letting the test set influence choices, otherwise your “estimate” of generalization becomes biased.

Application/Connection: Supervised vs Unsupervised Learning #

Why this distinction matters #

“Supervised vs unsupervised” is not just a label—it determines what objective you can write down.

This changes evaluation, failure modes, and what “success” even means.

Supervised learning #

Data: D = { (xᵢ, yᵢ) }

Goal: learn a mapping x → y.

Typical objective (ERM):

R̂(θ) = (1/n) ∑ᵢ ℓ(f(xᵢ; θ), yᵢ)

Common tasks:

Evaluation is comparatively straightforward:

Unsupervised learning #

Data: D = { xᵢ }

Goal: discover structure in x.

Because there is no y, the objective can take different forms:

  1. 1)Clustering: assign cluster IDs cᵢ to minimize within-cluster dispersion.

min_{μ₁,…,μ_k, c₁,…,cₙ} ∑ᵢ ‖xᵢ − μ_{cᵢ}‖²

  1. 2)Density estimation / generative modeling: fit p(x; θ).

θ̂ = argmax_θ ∏ᵢ p(xᵢ; θ)

  1. 3)Dimensionality reduction: find a low-dimensional representation z = g(x).

Evaluation is trickier because “ground truth” labels may not exist. You often evaluate indirectly:

Semi-supervised and self-supervised (bridge concepts) #

Many modern systems mix ideas:

These still follow the same template: define f and define an objective J.

A single table to anchor the big picture #

Learning settingData availableTypical objectiveWhat success looks like
Supervised(x, y) pairsminimize prediction losslow test error on y
Unsupervisedx onlyclustering / likelihood / reconstructionmeaningful structure; useful representations
Semi-supervisedfew (x, y) + many xcombined supervised + unsupervisedbetter than supervised-only with few labels
Self-supervisedx only (pseudo y)predict withheld parts / contrast viewstransferable representations

Connections to what you’ll learn next #

This intro node enables specific, concrete models and deeper theory:

The through-line: every one of these is still “choose f(x; θ), choose J(θ), optimize.”

Worked Examples (3) #

From MLE to a supervised training loss (negative log-likelihood) #

You have labeled data D = { (xᵢ, yᵢ) }ᵢ₌₁ⁿ. Your model specifies a conditional probability p(y | x; θ). Show how maximizing likelihood becomes minimizing an empirical risk with a per-example loss.

  1. Start with the likelihood of the dataset (conditional independence assumption):

    L(θ) = ∏ᵢ₌₁ⁿ p(yᵢ | xᵢ; θ)

  2. Take logs to make products into sums:

    log L(θ) = log(∏ᵢ p(yᵢ | xᵢ; θ))

    = ∑ᵢ log p(yᵢ | xᵢ; θ)

  3. MLE chooses θ that maximizes log-likelihood:

    θ̂ = argmax_θ ∑ᵢ log p(yᵢ | xᵢ; θ)

  4. Convert to a minimization by multiplying by −1:

    θ̂ = argmin_θ −∑ᵢ log p(yᵢ | xᵢ; θ)

  5. Recognize this as a sum of per-example losses. Define:

    ℓᵢ(θ) = −log p(yᵢ | xᵢ; θ)

    Then the objective is:

    J(θ) = ∑ᵢ ℓᵢ(θ)

  6. Optionally average over n (does not change the argmin):

    R̂(θ) = (1/n) ∑ᵢ₌₁ⁿ ℓᵢ(θ)

    This matches ERM with loss ℓ(f(xᵢ; θ), yᵢ) when f induces p(y|x;θ).

Insight: A large fraction of supervised ML is just MLE with a convenient parameterization. The “loss function” is often simply the negative log-likelihood of your probabilistic model.

A tiny supervised ERM example with squared loss #

Suppose f(x; θ) = θx (a 1D linear model with one parameter θ). Data: (x₁, y₁) = (1, 2), (x₂, y₂) = (2, 3). Use squared loss ℓ(ŷ, y) = (ŷ − y)². Write the empirical risk and find the minimizing θ by calculus.

  1. Write predictions:

    ŷ₁ = f(1; θ) = θ

    ŷ₂ = f(2; θ) = 2θ

  2. Write empirical risk (average squared loss):

    R̂(θ) = (1/2)[(θ − 2)² + (2θ − 3)²]

  3. Expand the squares:

    (θ − 2)² = θ² − 4θ + 4

    (2θ − 3)² = 4θ² − 12θ + 9

  4. Sum and scale:

    R̂(θ) = (1/2)[(θ² − 4θ + 4) + (4θ² − 12θ + 9)]

    = (1/2)[5θ² − 16θ + 13]

    = (5/2)θ² − 8θ + 13/2

  5. Differentiate and set to zero:

    ∂R̂/∂θ = 5θ − 8

    Set 5θ − 8 = 0 ⇒ θ = 8/5 = 1.6

  6. Check (optional) that it is a minimum:

    ∂²R̂/∂θ² = 5 > 0 ⇒ minimum.

Insight: Even this toy problem follows the full ML pattern: define f(x; θ), choose a loss, average it into R̂(θ), then optimize. In higher dimensions you typically use vector calculus and gradient descent rather than solving in closed form.

Unsupervised objective example: k-means as empirical risk without labels #

You have unlabeled points x₁,…,xₙ in ℝᵈ. You want to group them into k clusters. Write the k-means objective and interpret it as empirical risk minimization over latent assignments.

  1. Introduce cluster centers μ₁,…,μ_k where each μⱼ ∈ ℝᵈ.

  2. Introduce an assignment cᵢ ∈ {1,…,k} for each data point xᵢ (a latent/hidden variable).

  3. Define the distortion (squared distance) for a point to its assigned center:

    lossᵢ = ‖xᵢ − μ_{cᵢ}‖²

  4. Sum over the dataset to get an empirical objective:

    J(μ₁,…,μ_k, c₁,…,cₙ) = ∑ᵢ₌₁ⁿ ‖xᵢ − μ_{cᵢ}‖²

  5. Interpretation:

    • •The “parameters” are the centers μⱼ.
    • •The “labels” cᵢ are not given; they are chosen to minimize J.
    • •The objective encourages points in the same cluster to be close to their center.
  6. Training alternates (informally):

    • •Fix centers, choose assignments cᵢ to nearest μⱼ.
    • •Fix assignments, update each μⱼ to the mean of its assigned points.

    This decreases J each step.

Insight: Unsupervised learning still optimizes a scalar objective, but the objective is about structure (compact clusters) rather than matching given targets y. The absence of labels changes both the objective and how you validate results.

Key Takeaways #

Common Mistakes #

Practice #

easy

You have supervised data D = { (xᵢ, yᵢ) }ᵢ₌₁ⁿ and a model f(x; θ). Write the empirical risk for absolute error loss ℓ(ŷ, y) = |ŷ − y|. Is the objective differentiable everywhere?

Hint: Write R̂(θ) = (1/n)∑|f(xᵢ;θ) − yᵢ|. Consider what happens when f(xᵢ;θ) = yᵢ.

Show solution

The empirical risk is:

R̂(θ) = (1/n) ∑ᵢ₌₁ⁿ |f(xᵢ; θ) − yᵢ|.

It is not differentiable at points where f(xᵢ; θ) − yᵢ = 0 for some i (the absolute value has a kink at 0). It is subdifferentiable there, and differentiable elsewhere.

medium

Show that minimizing the average loss (1/n)∑ᵢ ℓᵢ(θ) has the same minimizer as minimizing the sum ∑ᵢ ℓᵢ(θ).

Hint: Multiplying an objective by a positive constant does not change the argmin.

Show solution

Let J(θ) = ∑ᵢ₌₁ⁿ ℓᵢ(θ) and R̂(θ) = (1/n)J(θ).

For any θ₁, θ₂:

J(θ₁) ≤ J(θ₂) ⇔ (1/n)J(θ₁) ≤ (1/n)J(θ₂)

because 1/n > 0.

Therefore θ that minimizes J also minimizes R̂, so:

argmin_θ J(θ) = argmin_θ R̂(θ).

medium

Unsupervised vs supervised identification: For each dataset type below, say whether it is supervised or unsupervised and propose a plausible objective.

(A) 10,000 images each tagged with {cat, dog}

(B) 1,000,000 unlabeled user click sequences

(C) 50,000 house listings with final sale price

Hint: Look for whether y is present. If not, think clustering, likelihood, reconstruction, or contrastive objectives.

Show solution

(A) Supervised (classification). Objective: minimize cross-entropy / NLL for p(y|x;θ).

(B) Unsupervised (no labels). Objective examples: maximize likelihood of sequences p(x;θ), or self-supervised next-item prediction / masked prediction loss.

(C) Supervised (regression). Objective: minimize squared loss (ŷ − y)² (possibly with regularization).

Connections #

Next nodes you can study:

Quality: A (4.3/5)

← back to treebrowse all →