In each region, the network is linear:
- If zⱼ<0 ⇒ ReLU(zⱼ)=0 (unit off)
- If zⱼ>0 ⇒ ReLU(zⱼ)=zⱼ (unit on)
So different regions activate different subsets
of linear pieces, yielding a “bent” boundary
when you solve y(x)=0.
This picture explains why ReLU networks are powerful: with many units, you get many regions, and therefore many linear pieces. Depth increases the number of regions dramatically.
Smooth activations (sigmoid/tanh) give smooth warps #
Sigmoid:
σ(z)=11+e−z\sigma(z) = \frac{1}{1+e^{-z}}σ(z)=1+e−z1
Tanh:
tanh(z)=ez−e−zez+e−z\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}tanh(z)=ez+e−zez−e−z
These functions don’t create sharp “kinks” like ReLU; they create smooth transitions. That can be beneficial (smooth gradients) but can also cause saturation for large |z|.
Output range and centering #
A practical (often overlooked) design detail is the range and mean of activations:
| Activation | Range | Zero-centered? | Typical note |
|---|
| Sigmoid | (0, 1) | No | Good for probabilities; saturates |
| tanh | (-1, 1) | Yes | Often better than sigmoid in hidden layers |
| ReLU | [0, ∞) | No | Sparse activations; simple; risk of dead units |
Zero-centering matters because if activations are mostly positive, the next layer’s gradients can become biased (all weights pushed similarly), sometimes slowing optimization.
Core Mechanic 2: Gradients, Saturation, and Training Dynamics #
The derivative is the “gate” for backprop #
Let one neuron be:
- •z=w⊤x+bz = \mathbf{w}^\top \mathbf{x} + bz=w⊤x+b
- •a=f(z)a = f(z)a=f(z)
During backprop, the gradient that flows into zzz is:
δz≡∂L∂z=∂L∂a f′(z)=δa f′(z)\delta_z \equiv \frac{\partial L}{\partial z} = \frac{\partial L}{\partial a}, f'(z) = \delta_a, f'(z)δz≡∂z∂L=∂a∂Lf′(z)=δaf′(z)
So f′(z)f'(z)f′(z) is literally a multiplier on the error signal.
If you stack many layers, you multiply many such terms (plus weight matrices). A simplified 1D intuition:
∂L∂z(1)≈∂L∂z(L)∏ℓ=1L−1f′(z(ℓ)) w(ℓ)\frac{\partial L}{\partial z^{(1)}} \approx \frac{\partial L}{\partial z^{(L)}} \prod_{\ell=1}^{L-1} f'\big(z^{(\ell)}\big), w^{(\ell)}∂z(1)∂L≈∂z(L)∂Lℓ=1∏L−1f′(z(ℓ))w(ℓ)
If the product shrinks toward 0, you get vanishing gradients. If it grows huge, exploding gradients.
Saturation: when learning stalls #
Sigmoid derivative:
First compute it cleanly:
σ(z)=11+e−z\sigma(z) = \frac{1}{1+e^{-z}}σ(z)=1+e−z1
Differentiate:
σ′(z)=e−z(1+e−z)2\sigma'(z) = \frac{e^{-z}}{(1+e^{-z})^2}σ′(z)=(1+e−z)2e−z
A more useful identity:
σ′(z)=σ(z)(1−σ(z))\sigma'(z) = \sigma(z)(1-\sigma(z))σ′(z)=σ(z)(1−σ(z))
This peaks at σ(z)=0.5\sigma(z)=0.5σ(z)=0.5 (i.e., z=0z=0z=0):
maxσ′(z)=0.25\max \sigma'(z) = 0.25maxσ′(z)=0.25
So even in the best case, each sigmoid layer multiplies gradients by at most 0.25 (before considering weights). In the saturated tails (large |z|), σ′(z)≈0\sigma'(z)\approx 0σ′(z)≈0.
Tanh derivative:
ddztanh(z)=1−tanh2(z)\frac{d}{dz}\tanh(z) = 1 - \tanh^2(z)dzdtanh(z)=1−tanh2(z)
This peaks at 1 when z=0z=0z=0, but still goes to 0 as |z| grows.
Interpretation:
- •Sigmoid/tanh are smooth but can “turn off” gradients when zzz drifts into saturation.
- •This is why modern hidden layers rarely use sigmoid (except special cases like gates in LSTMs).
ReLU: non-saturating (half the time) #
ReLU derivative is:
ReLU′(z)={0z<01z>0\operatorname{ReLU}'(z) = \begin{cases}
0 & z<0\
1 & z>0
\end{cases}ReLU′(z)={01z<0z>0
(Undefined at 0, but implementations pick 0 or 1; it rarely matters in practice.)
So for active units (z>0), gradients pass through unchanged (multiplied by 1). That’s a big reason ReLU enabled very deep networks to train effectively.
But the zero-derivative region causes the dying ReLU problem: if a neuron’s inputs make z<0z<0z<0 for most data, it outputs 0 and gets no gradient to recover.
Common variants and why they exist #
Most activation variants are attempts to tune one or more properties:
- •keep gradients alive
- •keep outputs well-scaled
- •avoid numerical issues
- •preserve some sparsity
| Activation | Definition | Derivative behavior | Typical use |
|---|
| Leaky ReLU | max(αz,z)\max(\alpha z, z)max(αz,z) | small slope α for z<0 | reduces dead ReLUs |
| ELU | zzz if z>0 else α(ez−1)\alpha(e^z-1)α(ez−1) | smooth negative side | sometimes improves convergence |
| GELU | z Φ(z)z,\Phi(z)zΦ(z) (approx) | smooth, nonzero slope | Transformers/modern NLP |
| Softplus | log(1+ez)\log(1+e^z)log(1+ez) | smooth ReLU; never 0 | when you need smoothness |
Leaky ReLU derivative:
f(z)={αzz<0zz≥0⇒f′(z)={αz<01z>0f(z)=\begin{cases}
\alpha z & z<0\
z & z\ge 0
\end{cases}\quad\Rightarrow\quad f'(z)=\begin{cases}
\alpha & z<0\
1 & z>0
\end{cases}f(z)={αzzz<0z≥0⇒f′(z)={α1z<0z>0
So even when “off,” some gradient passes.
Sparsity: why zeros can be a feature #
ReLU-like activations produce many exact zeros. That implies:
- •sparse activations → fewer active paths → sometimes easier optimization
- •implicit regularization: fewer co-adaptations
- •computational benefits in some systems
But sparsity is not free: too many zeros can reduce effective capacity and can stall learning for dead units.
Numerical stability considerations #
Activation functions interact with floating-point behavior.
•Sigmoid overflow/underflow: e−ze^{-z}e−z can overflow for large negative z (since -z becomes large positive). Stable implementations branch:
•if z ≥ 0 use σ(z)=1/(1+e−z)\sigma(z)=1/(1+e^{-z})σ(z)=1/(1+e−z)
•else use σ(z)=ez/(1+ez)\sigma(z)=e^z/(1+e^z)σ(z)=ez/(1+ez)
•Softplus stability: log(1+ez)\log(1+e^z)log(1+ez) is stabilized via:
softplus(z)=log(1+ez)=max(0,z)+log(1+e−∣z∣)\operatorname{softplus}(z) = \log(1+e^z) = \max(0,z) + \log(1+e^{-|z|})softplus(z)=log(1+ez)=max(0,z)+log(1+e−∣z∣)
This avoids overflow when z is large.
- •Saturation as “numerical” issue: even if you avoid overflow, saturation still produces gradients near machine precision (effectively 0), which is an optimization issue.
These concerns connect directly to careful initialization and normalization methods you’ll meet later.
Application/Connection: Choosing Activations in Real Networks #
A practical decision process #
You rarely choose an activation in isolation. You choose it given:
- •task (classification vs regression)
- •depth
- •normalization (BatchNorm/LayerNorm)
- •expected input scale
- •whether you need probabilities or bounded outputs
A useful rule of thumb:
- 1)Hidden layers (general deep nets): ReLU / GELU are common defaults.
- 2)Output layers: choose based on the output’s meaning.
Output activations (match the target) #
| Task | Output activation | Output meaning |
|---|
| Binary classification | sigmoid | p(y=1∣x)p(y=1\mid \mathbf{x})p(y=1∣x) |
| Multi-class (single label) | softmax | categorical distribution |
| Regression (unbounded) | identity | any real value |
| Regression (positive) | softplus or exp | positive real |
| Regression (bounded) | tanh/sigmoid | constrained range |
Softmax is not elementwise (it mixes logits), but it’s commonly discussed alongside activations. Elementwise activations typically happen in hidden layers; softmax is a special output nonlinearity.
Hidden layers: why ReLU/GELU are popular #
ReLU:
- •simple and fast
- •non-saturating on positive side
- •encourages sparsity
GELU (common in Transformers):
- •smooth version of “keep positive, damp negative”
- •can yield slightly better optimization in some settings
When sigmoid/tanh are still useful #
Even though sigmoid/tanh can saturate, they remain important:
- •Sigmoid is ideal for probabilities and gating.
- •tanh is useful when you want bounded, zero-centered hidden state (classic RNNs).
A note on initialization and activation pairing #
Activations influence how variance propagates forward/backward. While you’ll study initialization formally later, the intuition is:
- •If fff squashes too hard (sigmoid), signals and gradients shrink.
- •If fff passes with slope ~1 for many inputs (ReLU on positive side), signals survive better.
This is why “He initialization” is often paired with ReLU-like activations, and “Xavier/Glorot” is often paired with tanh.
Summary: what properties you’re trading off #
| Property | Helps with | Often hurts |
|---|
| Bounded outputs (sigmoid/tanh) | stability, interpretability | saturation → vanishing gradients |
| Unbounded positive (ReLU) | gradient flow, simplicity | dead units, nonzero mean |
| Smoothness (tanh, softplus, GELU) | stable optimization, differentiability | can reduce sparsity; may saturate |
| Sparsity (ReLU) | regularization-like effect | too many inactive units |
Activation functions are not just “a nonlinearity.” They are a design lever that shapes geometry (expressiveness) and learning (gradient flow).
Worked Examples (3) #
Show that stacking linear layers without activations collapses to one linear layer #
Consider a 2-layer network with no activation functions:
\nLayer 1: h=W1x+b1\mathbf{h} = \mathbf{W}_1\mathbf{x} + \mathbf{b}_1h=W1x+b1
\nLayer 2: y=W2h+b2\mathbf{y} = \mathbf{W}_2\mathbf{h} + \mathbf{b}_2y=W2h+b2.
\nShow that y\mathbf{y}y is an affine function of x\mathbf{x}x and write the equivalent single-layer parameters.
Substitute Layer 1 into Layer 2:
\ny=W2(W1x+b1)+b2\mathbf{y} = \mathbf{W}_2(\mathbf{W}_1\mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2y=W2(W1x+b1)+b2
Distribute W2\mathbf{W}_2W2:
\ny=W2W1x+W2b1+b2\mathbf{y} = \mathbf{W}_2\mathbf{W}_1\mathbf{x} + \mathbf{W}_2\mathbf{b}_1 + \mathbf{b}_2y=W2W1x+W2b1+b2
Group terms as a single affine map y=Wx+b\mathbf{y} = \mathbf{W}\mathbf{x} + \mathbf{b}y=Wx+b:
\nW=W2W1\mathbf{W} = \mathbf{W}_2\mathbf{W}_1W=W2W1,
\nb=W2b1+b2\mathbf{b} = \mathbf{W}_2\mathbf{b}_1 + \mathbf{b}_2b=W2b1+b2
Conclude: any number of stacked linear layers (no nonlinear activation between them) equals one linear layer, so depth adds no representational power in that case.
Insight: This is the core “why before how” for activations: fff prevents this collapse, allowing each layer to change the function class rather than merely re-parameterize a linear map.
Compute backprop through a single activation and see saturation numerically (sigmoid vs ReLU) #
Let a scalar neuron be z=wx+bz = wx + bz=wx+b, a=f(z)a=f(z)a=f(z), and loss L=12(a−y)2L = \tfrac{1}{2}(a - y)^2L=21(a−y)2.
\nPick x=1x=1x=1, w=1w=1w=1, b=0b=0b=0, y=1y=1y=1.
\nCompare gradients ∂L/∂w\partial L/\partial w∂L/∂w for:
sigmoid f(z)=σ(z)f(z)=\sigma(z)f(z)=σ(z) at z=0z=0z=0 and at z=−10z=-10z=−10
ReLU f(z)=max(0,z)f(z)=\max(0,z)f(z)=max(0,z) at the same z values.
Compute generic derivatives.
\nSince L=12(a−y)2L=\tfrac{1}{2}(a-y)^2L=21(a−y)2, we have:
\n∂L∂a=a−y\frac{\partial L}{\partial a} = a - y∂a∂L=a−y.
\nAnd by chain rule:
\n∂L∂w=∂L∂a⋅∂a∂z⋅∂z∂w=(a−y) f′(z) x\frac{\partial L}{\partial w} = \frac{\partial L}{\partial a}\cdot \frac{\partial a}{\partial z}\cdot \frac{\partial z}{\partial w} = (a-y),f'(z),x∂w∂L=∂a∂L⋅∂z∂a⋅∂w∂z=(a−y)f′(z)x.
\nWith x=1x=1x=1: ∂L/∂w=(a−y)f′(z)\partial L/\partial w = (a-y)f'(z)∂L/∂w=(a−y)f′(z).
Case A (sigmoid) at z=0.
\na=σ(0)=0.5a=\sigma(0)=0.5a=σ(0)=0.5.
\nf′(0)=σ(0)(1−σ(0))=0.5⋅0.5=0.25f'(0)=\sigma(0)(1-\sigma(0))=0.5\cdot 0.5=0.25f′(0)=σ(0)(1−σ(0))=0.5⋅0.5=0.25.
\nSo ∂L/∂w=(0.5−1)⋅0.25=(−0.5)⋅0.25=−0.125\partial L/\partial w = (0.5-1)\cdot 0.25 = (-0.5)\cdot 0.25 = -0.125∂L/∂w=(0.5−1)⋅0.25=(−0.5)⋅0.25=−0.125.
Case B (sigmoid) at z=-10.
\na=σ(−10)≈0.0000454a=\sigma(-10)\approx 0.0000454a=σ(−10)≈0.0000454.
\nf′(−10)=a(1−a)≈0.0000454⋅(0.9999546)≈0.0000454f'(-10)=a(1-a)\approx 0.0000454\cdot (0.9999546)\approx 0.0000454f′(−10)=a(1−a)≈0.0000454⋅(0.9999546)≈0.0000454.
\nSo ∂L/∂w≈(0.0000454−1)⋅0.0000454≈(−0.9999546)⋅0.0000454≈−0.0000454\partial L/\partial w \approx (0.0000454-1)\cdot 0.0000454 \approx (-0.9999546)\cdot 0.0000454 \approx -0.0000454∂L/∂w≈(0.0000454−1)⋅0.0000454≈(−0.9999546)⋅0.0000454≈−0.0000454.
\nThe gradient magnitude shrank from 1e-1 to about 1e-5 due to saturation.
Case C (ReLU) at z=0.
\na=max(0,0)=0a=\max(0,0)=0a=max(0,0)=0.
\nAt exactly 0, ReLU derivative is undefined; implementations pick 0 or 1. Consider a tiny positive z (e.g., z=+ε) to represent “active” behavior: then f′(z)=1f'(z)=1f′(z)=1.
\nIf z≈0⁺: a≈0a≈0a≈0, so ∂L/∂w=(0−1)⋅1=−1\partial L/\partial w = (0-1)\cdot 1 = -1∂L/∂w=(0−1)⋅1=−1.
Case D (ReLU) at z=-10.
\na=0a=0a=0 and f′(z)=0f'(z)=0f′(z)=0 (inactive).
\nSo ∂L/∂w=(0−1)⋅0=0\partial L/\partial w = (0-1)\cdot 0 = 0∂L/∂w=(0−1)⋅0=0.
\nNo gradient flows: this is the dying-ReLU risk if many datapoints keep z<0.
Insight: Sigmoid gives you some gradient when saturated, but it can be extremely small; ReLU gives you strong gradients when active, but exactly zero when inactive. Training dynamics are dominated by these local derivative regimes.
Piecewise linearity in 1D: a tiny ReLU network becomes a ‘hinge sum’ #
Let x∈Rx \in \mathbb{R}x∈R and define a 2-unit ReLU network:
\ny(x)=v1 ReLU(x−1)+v2 ReLU(x+1)+cy(x) = v_1,\operatorname{ReLU}(x - 1) + v_2,\operatorname{ReLU}(x + 1) + cy(x)=v1ReLU(x−1)+v2ReLU(x+1)+c.
\nShow explicitly that y(x)y(x)y(x) is linear on each interval cut by the hinge points x=−1x=-1x=−1 and x=1x=1x=1.
Identify hinge points where each ReLU switches.
\nReLU(x−1)\operatorname{ReLU}(x-1)ReLU(x−1) switches at x=1x=1x=1.
\nReLU(x+1)\operatorname{ReLU}(x+1)ReLU(x+1) switches at x=−1x=-1x=−1.
\nSo consider intervals: (−∞,−1)(-\infty,-1)(−∞,−1), [−1,1)[-1,1)[−1,1), [1,∞)[1,\infty)[1,∞).
Interval 1: x<−1x<-1x<−1.
\nThen x−1<0x-1<0x−1<0 and x+1<0x+1<0x+1<0.
\nSo both ReLUs are 0:
\ny(x)=cy(x)=cy(x)=c (constant, hence linear).
Interval 2: −1≤x<1-1 \le x < 1−1≤x<1.
\nThen x−1<0x-1<0x−1<0 but x+1≥0x+1\ge 0x+1≥0.
\nSo:
\nReLU(x−1)=0\operatorname{ReLU}(x-1)=0ReLU(x−1)=0,
\nReLU(x+1)=x+1\operatorname{ReLU}(x+1)=x+1ReLU(x+1)=x+1.
\nThus y(x)=v2(x+1)+c=v2x+(v2+c)y(x)=v_2(x+1)+c = v_2 x + (v_2+c)y(x)=v2(x+1)+c=v2x+(v2+c) (linear).
Interval 3: x≥1x\ge 1x≥1.
\nThen both x−1≥0x-1\ge 0x−1≥0 and x+1≥0x+1\ge 0x+1≥0.
\nSo:
\ny(x)=v1(x−1)+v2(x+1)+c=(v1+v2)x+(−v1+v2+c)y(x)=v_1(x-1)+v_2(x+1)+c = (v_1+v_2)x + (-v_1+v_2+c)y(x)=v1(x−1)+v2(x+1)+c=(v1+v2)x+(−v1+v2+c) (linear).
Conclude: the network is piecewise linear with ‘kinks’ at x=-1 and x=1; adding more ReLU units adds more hinge points, increasing shape flexibility.
Insight: This is the 1D version of the 2D partition picture: ReLUs introduce regions where different linear formulas apply, letting you build complex shapes by stitching simple pieces.
Key Takeaways #
✓
Activation functions apply an elementwise mapping a=f(z)a=f(z)a=f(z) to each neuron’s pre-activation zzz.
✓
Without nonlinear activations, stacked linear layers collapse into a single linear (affine) transformation—depth gives no extra expressiveness.
✓
The derivative f′(z)f'(z)f′(z) directly gates backpropagation: ∂L/∂z=(∂L/∂a) f′(z)\partial L/\partial z = (\partial L/\partial a),f'(z)∂L/∂z=(∂L/∂a)f′(z).
✓
Sigmoid and tanh saturate for large |z|, causing vanishing gradients in their tails; sigmoid’s maximum slope is only 0.25.
✓
ReLU is non-saturating for z>0 (good gradient flow) but has zero gradient for z<0 (risk of dead neurons).
✓
ReLU networks represent piecewise-linear functions; in higher dimensions, ReLU hyperplanes partition space into regions with different linear behaviors.
✓
Activation choice affects output range, zero-centering, sparsity, and numerical stability—so it influences both modeling and optimization.
Common Mistakes #
✗
Using sigmoid (or tanh) in many deep hidden layers without understanding saturation, then wondering why gradients vanish and training stalls.
✗
Assuming activations are “just a detail,” ignoring their derivatives—when f′(z)f'(z)f′(z) is the main determinant of gradient flow locally.
✗
Forgetting that ReLU can die: if pre-activations stay negative, the unit outputs 0 and receives zero gradient (especially with large learning rates or biased initialization).
✗
Mismatching output activation to the task (e.g., using ReLU for probabilities, or sigmoid for unbounded regression).
Practice #
easy
Compute derivatives: (a) σ′(z)\sigma'(z)σ′(z) for sigmoid σ(z)=1/(1+e−z)\sigma(z)=1/(1+e^{-z})σ(z)=1/(1+e−z), and (b) tanh′(z)\tanh'(z)tanh′(z). Then evaluate each derivative at z=0 and describe what it implies for gradient flow near the origin.
Hint: For sigmoid, try rewriting in terms of σ(z)\sigma(z)σ(z) after differentiating. For tanh, use tanh(z)=ez−e−zez+e−z\tanh(z)=\frac{e^z-e^{-z}}{e^z+e^{-z}}tanh(z)=ez+e−zez−e−z or the identity $1-\tanh^2(z)$.
Show solution
Sigmoid:
σ(z)=11+e−z\sigma(z)=\frac{1}{1+e^{-z}}σ(z)=1+e−z1
Differentiate:
σ′(z)=e−z(1+e−z)2\sigma'(z)=\frac{e^{-z}}{(1+e^{-z})^2}σ′(z)=(1+e−z)2e−z
And using σ(z)=1/(1+e−z)\sigma(z)=1/(1+e^{-z})σ(z)=1/(1+e−z) gives:
σ′(z)=σ(z)(1−σ(z)).\sigma'(z)=\sigma(z)(1-\sigma(z)).σ′(z)=σ(z)(1−σ(z)).
At z=0: σ(0)=0.5\sigma(0)=0.5σ(0)=0.5 so σ′(0)=0.25\sigma'(0)=0.25σ′(0)=0.25.
Tanh:
tanh′(z)=1−tanh2(z).\tanh'(z)=1-\tanh^2(z).tanh′(z)=1−tanh2(z).
At z=0: tanh(0)=0\tanh(0)=0tanh(0)=0 so tanh′(0)=1\tanh'(0)=1tanh′(0)=1.
Implication: near z=0, tanh passes gradient more strongly than sigmoid; sigmoid’s slope is capped at 0.25 even in its best region.
medium
Consider a 2-layer network with ReLU between layers: h=ReLU(W1x+b1)\mathbf{h}=\operatorname{ReLU}(\mathbf{W}_1\mathbf{x}+\mathbf{b}_1)h=ReLU(W1x+b1) and y=w2⊤h+b2y=\mathbf{w}_2^\top\mathbf{h}+b_2y=w2⊤h+b2. Explain (in words) why this network can represent non-linear decision boundaries in x-space, unlike the same architecture without ReLU.
Hint: Focus on what happens in regions where each component of W1x+b1\mathbf{W}_1\mathbf{x}+\mathbf{b}_1W1x+b1 is positive vs negative.
Show solution
With ReLU, each hidden unit outputs either 0 (if its pre-activation is negative) or a linear function of x (if positive). The set of signs of the hidden pre-activations partitions input space into regions; within each region, the network behaves like a linear model, but different regions have different linear formulas because different subsets of hidden units are active. The boundary y=0 can therefore bend across regions, forming a non-linear decision boundary. Without ReLU, the whole network is just one affine map in x, so y=0 is a single hyperplane (linear boundary).
hard
Dying ReLU thought experiment: Suppose a neuron uses ReLU and its pre-activation is z=wx+bz = wx + bz=wx+b. If your dataset has x mostly around 0 and you initialize b = -5 with small w, what happens to this neuron during early training? Propose one fix.
Hint: Evaluate the sign of z initially and connect it to ReLU′(z)\operatorname{ReLU}'(z)ReLU′(z).
Show solution
If x≈0 and w is small, then initially z≈b=-5, so z<0 for most examples. The neuron outputs a=ReLU(z)=0 and its derivative is ReLU'(z)=0 in that region. In backprop, ∂L/∂z\partial L/\partial z∂L/∂z gets multiplied by 0, so ∂L/∂w\partial L/\partial w∂L/∂w and ∂L/∂b\partial L/\partial b∂L/∂b for that neuron are ~0; it may not recover and becomes a dead unit. Fixes include: (1) initialize biases closer to 0 (or slightly positive) so some examples activate the unit, (2) use Leaky ReLU so the negative side has slope α>0 and gradients can update w and b, or (3) use normalization (e.g., BatchNorm) to keep z distribution near 0.
Connections #
Next nodes you can explore:
- •Deep Learning — activations become even more crucial as depth increases (gradient flow, expressiveness, architectural defaults like ReLU/GELU).
- •Numerical Stability and Conditioning — stable implementations of sigmoid/softplus, saturation as an optimization/stability issue, and why scaling/normalization matter.
Quality: A (4.4/5)
← back to treebrowse all →