Activation Functions
What is an activation function?
An activation function is a small mathematical
function applied to the output of each neuron (node) in a neural network.
It decides: “Given the weighted sum of inputs this neuron
received, what should it finally output?”
Without activation function → neuron output = just a
linear combination (weighted sum + bias)
With activation function → neuron output = something
more intelligent (usually non-linear)
Why Do We Need Activation Functions? (The Real Reason)
There are two big reasons:
Reason 1: To Introduce Non-Linearity (The Most Important
Reason)
Real-world data is not linear.
Examples:
- Is
this a cat or dog in the photo? → Decision boundary is curved/complex
- Will
the stock price go up or down? → Highly non-linear
- Will a
customer click the ad? → Non-linear patterns
Fact: If you use only linear functions (or no
activation), then even a 1000-layer neural network collapses into one single
linear equation.
→ Your deep network becomes exactly the same as simple
linear regression → extremely weak!
Non-linear activation functions (ReLU, GELU,
Sigmoid, etc.) allow the network to learn curves, corners, complex
patterns → that’s what makes deep learning powerful.
Reason 2: To Control and Stabilize Neuron Behavior
Different activations help in different ways:
|
Activation |
What It Controls |
|
Sigmoid |
Squashes output to 0–1 → perfect for probability |
|
ReLU |
Turns off negative values → speeds up training |
|
Tanh |
Centers data around 0 → helps gradients flow |
|
Softmax |
Turns numbers into probabilities that sum to 1 |
Simple Analogy
Think of a neuron as a tiny decision-maker in the brain.
- Without
activation function → It just passes the raw sum: “I got 5 votes, I say 5”
- With
activation function → It thinks: “I got 5 votes → that’s a lot → I’m very
confident → output almost 1” “I got -3 votes → negative → I’m quiet →
output 0”
That “thinking” step is the activation function.
Practical guide to the most important activation functions used in neural networks today (2025).
I’ll explain what they do, why they exist, when to use them, and their pros/cons—no fluff.
|
Activation Function |
Formula |
Output Range |
Where It's Used |
Pros |
Cons / Problems |
|
Sigmoid |
σ(x) = 1 / (1 + e⁻ˣ) |
(0, 1) |
Old binary classifiers (output layer) |
Gives probability-like output |
Vanishing gradients, outputs saturate → kills
learning in deep nets. Almost never used in hidden layers now. |
|
Tanh (Hyperbolic Tangent) |
tanh(x) = (eˣ − e⁻ˣ)/(eˣ + e⁻ˣ) |
(-1, 1) |
Hidden layers in older RNNs/LSTMs |
Zero-centered → faster convergence than sigmoid |
Still has vanishing gradients (just slower than
sigmoid). Mostly replaced. |
|
ReLU (Rectified Linear Unit) |
f(x) = max(0, x) |
[0, ∞) |
Default choice for hidden layers in 2025 |
Very fast, no vanishing gradient for x > 0,
biologically plausible |
Dying ReLU problem: neurons can permanently output 0 and stop
learning if input is always negative |
|
Leaky ReLU |
f(x) = x if x > 0 else αx (α ≈ 0.01) |
(-∞, ∞) |
Hidden layers when ReLU dies too much |
Fixes dying ReLU by allowing small negative slope |
Slight extra computation, α is a hyperparameter |
|
PReLU (Parametric ReLU) |
Same as Leaky ReLU but α is learned |
(-∞, ∞) |
Used in some top models (ResNet, etc.) |
α adapts per channel → often better than fixed
Leaky ReLU |
More parameters |
|
ELU (Exponential Linear Unit) |
f(x) = x if x > 0 else α(eˣ − 1) |
(-α, ∞) |
Hidden layers when you want smoother negatives |
Zero-centered mean, no dying neurons, faster learning in
some cases |
Slower than ReLU (exponential is expensive) |
|
GELU (Gaussian Error Linear Unit) |
x × Φ(x) ≈ 0.5x(1 + tanh(√(2/π)(x + 0.044715x³))) |
(-∞, ∞) |
Transformers (BERT, GPT, Llama, Grok, etc.) |
Smooth, probabilistic interpretation,
state-of-the-art performance |
Slightly slower than ReLU |
|
Swish / SiLU |
x × sigmoid(x) |
(-∞, ∞) |
Modern CNNs, some Transformers |
Often beats ReLU/GELU on many tasks, smooth, non-monotonic |
Slightly slower |
|
Mish |
x × tanh(ln(1 + e˓)) |
(-∞, ∞) |
Some vision models |
Claims to outperform Swish in some cases |
Rarely used in practice |
|
Softmax |
softmax(xᵢ) = e^(xᵢ) / Σⱼ e^(xⱼ) |
(0,1) and sums to 1 |
Output layer for multi-class classification |
Turns logits into probabilities |
Can be numerically unstable (use logits directly in loss
instead) |
|
Linear / Identity |
f(x) = x |
(-∞, ∞) |
Regression output layers, last layer of GAN
generators |
No squashing → can predict any real number |
— |
What You Should Actually Use
|
Task |
Hidden Layers Activation |
Output Layer Activation |
|
Image classification (CNNs) |
ReLU or GELU |
Softmax (multi-class) |
|
Transformers (GPT, BERT, etc.) |
GELU (original) or Swish/SiLU |
Softmax (or none for next-token) |
|
Binary classification |
ReLU / GELU |
Sigmoid |
|
Regression (predict numbers) |
ReLU / GELU |
Linear (no activation) |
|
Deep networks (>20 layers) |
GELU, Swish, or Leaky ReLU/PReLU |
— |
|
When training is unstable |
Try Swish or GELU |
— |
Comments