Neural Networks

←Back to Tech Tree

inventorycoverage

Neural Networks #

Machine LearningDifficulty: ★★★★☆Depth: 10Unlocks: 11

Layers of nonlinear transformations. Universal approximators.

Interactive Visualization #

⏮◀◀▶▶STEP0.25x1xZOOM

t=0s

Core Concepts #

Key Symbols & Notation #

W (weight matrix)b (bias vector)

Essential Relationships #

Prerequisites (2) #

Logistic Regression6 atomsMatrix Calculus6 atoms

Unlocks (6) #

Backpropagationlvl 4Regularizationlvl 4Layer Normalizationlvl 4Variational Autoencoderslvl 5Dimensionality Reductionlvl 4Generative Adversarial Networkslvl 5

Advanced Learning Details

Graph Position #

136

Depth Cost

11

Fan-Out (ROI)

5

Bottleneck Score

10

Chain Length

Cognitive Load #

6

Atomic Elements

52

Total Elements

L4

Percentile Level

L4

Atomic Level

All Concepts (20) #

Teaching Strategy #

Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.

A neural network is what you get when you stop asking a model to be “one good formula” and instead let it be “many simple formulas composed together.” Each layer is easy to understand (an affine map plus a nonlinearity), but stacking layers creates a surprisingly rich family of functions—rich enough to approximate almost any smooth pattern you can express with data.

TL;DR:

Neural networks are parametric functions built by alternating affine transformations (x ↦ Wx + b) and elementwise nonlinearities. The nonlinearity is essential: without it, many layers collapse into one linear map. With it, depth creates expressive, flexible models that generalize logistic regression to multi-layer feature learning.

What Is a Neural Network? #

Why this concept exists #

Logistic regression is powerful because it turns a linear score (Wx + b) into a nonlinear probability using σ(·). But it still fundamentally draws a linear decision boundary in the original feature space: it can only separate classes that are linearly separable (or close to it).

Neural networks extend this idea by repeatedly doing two steps:

  1. 1)Mix features linearly (affine transformation)
  2. 2)Warp them nonlinearly (activation)

By composing these steps many times, a network can learn intermediate representations—new “features” that make a difficult task easy for the final layer.

The core object: a parametric function #

A (feedforward) neural network defines a function:

f( x; θ ) → y

where θ is the set of all parameters, typically weights and biases across layers:

θ = { (W¹, b¹), (W², b²), …, (Wᴸ, bᴸ) }

A standard L-layer multilayer perceptron (MLP) has hidden states hˡ computed as:

The final layer is sometimes left “linear” (no activation) depending on the task.

Shapes (so nothing feels mysterious) #

If the input has dimension d and layer l has width nˡ:

This is just matrix-vector multiplication plus a bias.

What does “universal approximator” mean (informally)? #

A key motivation: neural networks can approximate very complicated functions. Roughly:

Important nuance:

Neural networks as “learned features” #

A helpful mental model is:

This is logistic regression’s idea—applied repeatedly.

Connection to logistic regression #

Binary logistic regression can be written as a 1-layer network:

p(y = 1 | x) = σ( Wx + b )

A deeper network replaces the direct linear score with a learned feature map h = g(x) and then uses:

p(y = 1 | x) = σ( wh + b )

where g(·) is itself a composition of affine maps and nonlinearities.

Core Mechanic 1: Affine Transformations as Learnable Feature Mixing #

Why affine maps are the workhorse #

An affine transformation

x ↦ Wx + b

is the simplest learnable operation that can:

If you have matrix calculus background, you already know how gradients behave for affine maps, which is one reason they are so central.

Geometry intuition #

Think of x as a point in ℝᵈ.

A linear layer can:

But a single affine layer cannot “bend” space. It cannot create curved decision boundaries by itself.

Affine layers and feature creation #

Suppose the first layer is:

z¹ = W¹ x + b¹

Each coordinate z¹ᵢ is:

z¹ᵢ = w¹ᵢᵀ x + b¹ᵢ

So every neuron computes a linear score of the input—very similar to logistic regression’s logit. The difference comes next: we don’t directly interpret this as a probability; we feed it forward as a feature.

Why the bias matters #

Without b, every hyperplane zᵢ = 0 must pass through the origin. Adding bias lets each neuron choose its own threshold.

This matters especially when combined with piecewise-linear activations (like ReLU), where the bias controls where the “kink” happens.

Layer widths: undercomplete vs overcomplete #

The hidden dimension nˡ affects what the network can represent.

ChoiceWhat it enablesWhat it risks
nˡ < nˡ⁻¹ (compression)bottleneck features, dimensionality reductioninformation loss
nˡ = nˡ⁻¹stable capacitymay need depth for expressiveness
nˡ > nˡ⁻¹ (expansion)rich feature mixing, sparse or disentangled featuresoverfitting, optimization cost

Composition of affine maps (a crucial warning) #

If you stack affine maps without nonlinearities:

h¹ = W¹ x + b¹

h² = W² h¹ + b²

then:

h² = W²(W¹ x + b¹) + b²

= (W²W¹)x + (W²b¹ + b²)

This is still just one affine map.

So if we only used affine layers, depth would be pointless. The entire expressive leap of neural networks comes from the nonlinearity.

Core Mechanic 2: Elementwise Nonlinearities (Activations) Create Expressiveness #

Why nonlinearities are the “magic ingredient” #

A nonlinearity φ applied elementwise:

h = φ(z) meaning hᵢ = φ(zᵢ)

is what prevents the network from collapsing into a single affine transformation.

Nonlinearities let networks represent:

Common activation functions #

You’ll see a small set of activations repeatedly:

ActivationFormulaRangeProsCons
Sigmoidσ(t) = 1/(1+e⁻ᵗ)(0,1)probabilistic interpretationsaturates, vanishing gradients
Tanhtanh(t)(−1,1)zero-centeredsaturates
ReLUmax(0, t)[0, ∞)simple, sparse, strong gradients for t>0“dead” units for t≤0
Leaky ReLUmax(αt, t)(−∞, ∞)reduces dead unitsextra hyperparameter α
GELUt·Φ(t) (approx)(−∞, ∞)smooth, strong in transformersmore compute

For many modern MLPs, ReLU/GELU variants dominate.

How ReLU creates piecewise linearity #

Consider a 1D input x and a single neuron:

h = ReLU(wx + b)

This is:

So it is a “hinge” function with a kink at x = −b/w.

If you sum many such hinges, you can approximate complex shapes. In higher dimensions, each ReLU neuron corresponds to a half-space boundary (wx + b = 0), and the network becomes a partition of input space into regions where the overall function is linear.

Depth vs width (intuition) #

This is part of why deep networks can represent complex functions more efficiently than shallow ones.

Output activations depend on the task #

The last layer is often chosen to match the meaning of outputs:

TaskOutputTypical final layer
Binary classificationp(y=1)sigmoid
Multi-class classificationp(y=k)softmax
Regressionreal valuelinear (identity)

For multi-class, with logits s ∈ ℝᴷ:

softmax(s)ₖ = exp(sₖ) / ∑ⱼ exp(sⱼ)

Why “elementwise” is a design choice #

Elementwise nonlinearities are simple and efficient. But note:

Many advanced architectures introduce nonlinearities that depend on multiple coordinates (attention, normalization, gating). But elementwise activations are the core starting point.

Layered Composition: From Simple Parts to a Single Powerful Function #

Why composition is the right mental model #

A neural network is best understood as a composition of functions:

f(x) = fᴸ ∘ fᴸ⁻¹ ∘ … ∘ f¹ (x)

where each layer function is typically:

fˡ(h) = φ( Wˡ h + bˡ )

Composition matters because:

Writing an MLP explicitly #

For a 2-hidden-layer network:

h¹ = φ( W¹ x + b¹ )

h² = φ( W² h¹ + b² )

y = W³ h² + b³

Even though each step is simple, the final mapping xy can be highly nonlinear.

Interpreting hidden units as detectors #

Each neuron computes:

hᵢ = φ(wh_prev + b)

This can be seen as:

In early layers, features might correspond to simple patterns.

In deeper layers, features become combinations of combinations.

A practical view: network as a feature map + linear head #

Many networks can be decomposed conceptually:

h = g(x; θ_g)

y = Ah + c

where g is the deep feature extractor and (A, c) is a linear “head.”

This view is helpful because:

Loss functions (what training tries to minimize) #

A neural network becomes useful when paired with a loss.

Given dataset {( x⁽ⁱ⁾, t⁽ⁱ⁾ )}ᵢ:

Minimize: (1/N) ∑ᵢ ℓ( f(x⁽ⁱ⁾), t⁽ⁱ⁾ )

Examples:

Because you know matrix calculus, you can view training as gradient-based optimization in a high-dimensional parameter space.

Parameter counting (capacity intuition) #

If layer l has nˡ units and previous layer has nˡ⁻¹:

Total parameters ≈ ∑ˡ (nˡ·nˡ⁻¹ + nˡ)

Large capacity can fit complex functions—but increases overfitting risk, motivating regularization (an unlock node).

A note on “universal approximation” and practice #

Universal approximation results say “there exists parameters.” In practice you must also consider:

This is why deep learning is both a theory of function classes and an engineering discipline.

Application / Connection: Where Neural Networks Fit in Machine Learning #

Neural networks as the default nonlinear model #

When feature engineering is hard, neural networks shine because they can learn representations.

They appear in many forms:

The common thread is still layers of affine-like transforms plus nonlinearities, often with architectural constraints.

Decision boundaries: from linear to complex #

Logistic regression gives a hyperplane boundary.

An MLP can build boundaries that are unions and compositions of many half-spaces.

A helpful picture (conceptual):

Why training needs backpropagation #

To train, you need gradients ∂ℓ/∂Wˡ and ∂ℓ/∂bˡ for every layer.

Naively computing derivatives separately for each parameter would be expensive.

Backpropagation is the efficient application of chain rule through the layered composition. This is exactly the next node you unlock.

Regularization and normalization are not optional in deep nets #

High-capacity models can memorize. Regularization techniques (L2, dropout, early stopping) and normalization (batch norm, layer norm) help:

These are also unlocked nodes—and they become much easier to appreciate once the basic network mapping is clear.

Neural nets as building blocks for generative models #

Variational autoencoders (VAEs) and many other generative models use neural networks to parameterize distributions:

In this sense, “neural network” is not the whole algorithm; it’s the function approximator inside the algorithm.

Connection to dimensionality reduction #

Autoencoders are neural networks trained to reconstruct inputs through a bottleneck:

x → encoder → z (low-dim) → decoder →

This creates a learned, nonlinear dimensionality reduction—connected to the dimensionality reduction node.

Worked Examples (3) #

Forward pass through a small MLP (with shapes and numbers) #

Compute the output of a 2-layer network with ReLU hidden layer and a linear output. Let x ∈ ℝ².

Given:

W¹ = [[1, −2],

[0, 3]] , b¹ = [−1, 2]

W² = [[2, −1]] , b² = [0.5]

Activation φ = ReLU.

Input x = [2, 1].

  1. Step 1: Compute pre-activation z¹ = W¹x + b¹.

    x = [[1, −2],[0, 3]] [2,1]

    = [1·2 + (−2)·1,

    0·2 + 3·1]

    = [0, 3]

    z¹ = [0, 3] + [−1, 2] = [−1, 5]

  2. Step 2: Apply ReLU: h¹ = ReLU(z¹).

    ReLU([−1, 5]) = [0, 5]

  3. Step 3: Compute output pre-activation z² = W²h¹ + b².

    h¹ = [2, −1] [0,5] = 2·0 + (−1)·5 = −5

    z² = −5 + 0.5 = −4.5

  4. Step 4: Since output is linear, y = z².

    Final output y = −4.5

Insight: Even with tiny matrices, you can see the pattern: affine → ReLU → affine. ReLU zeroed out the first hidden unit, so only the second feature contributed to the final score. This “selective routing” is a core behavior of ReLU networks.

Why nonlinearities are necessary: collapsing two affine layers into one #

Show that stacking affine layers without an activation is still just an affine map.

Let h¹ = W¹x + b¹ and h² = W²h¹ + b².

  1. Start with the definition of the second layer:

    h² = W²h¹ + b²

  2. Substitute h¹ = W¹x + b¹:

    h² = W²(W¹x + b¹) + b²

  3. Distribute W²:

    h² = W²W¹x + W²b¹ + b²

  4. Group terms into a single affine form:

    Let W̃ = W²W¹ and b̃ = W²b¹ + b².

    Then h² = W̃x + b̃

Insight: Depth without nonlinearity gives no extra expressive power. Activations prevent this collapse, making layered composition meaningful.

A tiny universal-approximation intuition in 1D with ReLU “hinges” #

Approximate a simple piecewise-linear function on x ∈ [0, 2] using a sum of shifted ReLUs.

Target function:

f(x) = { x for 0 ≤ x ≤ 1

{ 2 − x for 1 < x ≤ 2

This is a triangle peak at x=1. Show it can be written using ReLU(·).

  1. Recall ReLU(t) = max(0, t). Consider these three hinge functions:

    h₁(x) = ReLU(x)

    h₂(x) = ReLU(x − 1)

    h₃(x) = ReLU(x − 2)

  2. Construct a piecewise-linear function by combining them:

    g(x) = 1·h₁(x) − 2·h₂(x) + 1·h₃(x)

  3. Check intervals.

    For 0 ≤ x ≤ 1:

    • •h₁(x)=x
    • •h₂(x)=0
    • •h₃(x)=0

    So g(x)=x (matches f).

  4. For 1 ≤ x ≤ 2:

    • •h₁(x)=x
    • •h₂(x)=x−1
    • •h₃(x)=0

    So g(x)= x − 2(x−1) = x − 2x + 2 = 2 − x (matches f).

  5. For x ≥ 2:

    • •h₁(x)=x
    • •h₂(x)=x−1
    • •h₃(x)=x−2

    So g(x)= x − 2(x−1) + (x−2) = 0 (triangle returns to 0).

Insight: A sum of a few shifted ReLUs can build a nontrivial shape. In higher dimensions and deeper networks, this idea scales: many hinges compose into extremely rich functions.

Key Takeaways #

Common Mistakes #

Practice #

easy

You are given a network h¹ = ReLU(W¹x + b¹), y = σ(wh¹ + b). If ReLU were removed (replaced with identity), show that the model reduces to logistic regression in the original input x.

Hint: Substitute h¹ = W¹x + b¹ into the output and regroup terms into a single weight vector and bias.

Show solution

Without ReLU, h¹ = W¹x + b¹.

Then the logit is:

s = wh¹ + b

= wᵀ(W¹x + b¹) + b

= (wᵀW¹)x + (wb¹ + b)

Define w̃ᵀ = wᵀW¹ and b̃ = wb¹ + b.

So p(y=1|x) = σ(s) = σ(w̃ᵀx + b̃), which is logistic regression.

medium

Consider an MLP with input dimension d = 10, one hidden layer of width n¹ = 64, and output dimension K = 5 (multi-class). The hidden layer uses ReLU and the output uses softmax. How many parameters are there total (including biases)?

Hint: Count parameters per layer: W¹, b¹, W², b².

Show solution

Layer 1: W¹ ∈ ℝ⁶⁴×¹⁰ has 64·10 = 640 parameters. b¹ ∈ ℝ⁶⁴ has 64 parameters.

Layer 2: W² ∈ ℝ⁵×⁶⁴ has 5·64 = 320 parameters. b² ∈ ℝ⁵ has 5 parameters.

Total = 640 + 64 + 320 + 5 = 1029 parameters.

hard

Let φ be ReLU. For a 1-hidden-layer network f(x) = ∑ᵢ aᵢ ReLU(wᵢ x + bᵢ) + c in 1D, explain why f(x) is piecewise linear, and where its slope can change.

Hint: Each ReLU term changes from 0 to linear at the point where wᵢ x + bᵢ = 0.

Show solution

Each term ReLU(wᵢ x + bᵢ) is either 0 (when wᵢ x + bᵢ ≤ 0) or a linear function (wᵢ x + bᵢ) (when wᵢ x + bᵢ > 0). Therefore, on any interval where the sign of every (wᵢ x + bᵢ) is fixed, every term is linear (either constant 0 or linear), and the sum is linear.

A slope change can only occur when at least one neuron switches regime, i.e., at a breakpoint x where wᵢ x + bᵢ = 0 ⇒ x = −bᵢ / wᵢ (for wᵢ ≠ 0). Thus f(x) is piecewise linear with possible kinks at those breakpoint locations.

Connections #

Next nodes you’re set up for:

Related refreshers:

Quality: B (4.4/5)

← back to treebrowse all →