Activation Functions

What is an activation function?

An activation function is a small mathematical function applied to the output of each neuron (node) in a neural network.

It decides: “Given the weighted sum of inputs this neuron received, what should it finally output?”

Without activation function → neuron output = just a linear combination (weighted sum + bias)

With activation function → neuron output = something more intelligent (usually non-linear)

Why Do We Need Activation Functions? (The Real Reason)

There are two big reasons:

Reason 1: To Introduce Non-Linearity (The Most Important Reason)

Real-world data is not linear.

Examples:

Is this a cat or dog in the photo? → Decision boundary is curved/complex
Will the stock price go up or down? → Highly non-linear
Will a customer click the ad? → Non-linear patterns

Fact: If you use only linear functions (or no activation), then even a 1000-layer neural network collapses into one single linear equation.

→ Your deep network becomes exactly the same as simple linear regression → extremely weak!

Non-linear activation functions (ReLU, GELU, Sigmoid, etc.) allow the network to learn curves, corners, complex patterns → that’s what makes deep learning powerful.

Reason 2: To Control and Stabilize Neuron Behavior

Different activations help in different ways:

Activation	What It Controls
Sigmoid	Squashes output to 0–1 → perfect for probability
ReLU	Turns off negative values → speeds up training
Tanh	Centers data around 0 → helps gradients flow
Softmax	Turns numbers into probabilities that sum to 1

Simple Analogy

Think of a neuron as a tiny decision-maker in the brain.

Without activation function → It just passes the raw sum: “I got 5 votes, I say 5”
With activation function → It thinks: “I got 5 votes → that’s a lot → I’m very confident → output almost 1” “I got -3 votes → negative → I’m quiet → output 0”

That “thinking” step is the activation function.

Practical guide to the most important activation functions used in neural networks today (2025).

I’ll explain what they do, why they exist, when to use them, and their pros/cons—no fluff.

Activation Function	Formula	Output Range	Where It's Used	Pros	Cons / Problems
Sigmoid	σ(x) = 1 / (1 + e⁻ˣ)	(0, 1)	Old binary classifiers (output layer)	Gives probability-like output	Vanishing gradients, outputs saturate → kills learning in deep nets. Almost never used in hidden layers now.
Tanh (Hyperbolic Tangent)	tanh(x) = (eˣ − e⁻ˣ)/(eˣ + e⁻ˣ)	(-1, 1)	Hidden layers in older RNNs/LSTMs	Zero-centered → faster convergence than sigmoid	Still has vanishing gradients (just slower than sigmoid). Mostly replaced.
ReLU (Rectified Linear Unit)	f(x) = max(0, x)	[0, ∞)	Default choice for hidden layers in 2025	Very fast, no vanishing gradient for x > 0, biologically plausible	Dying ReLU problem: neurons can permanently output 0 and stop learning if input is always negative
Leaky ReLU	f(x) = x if x > 0 else αx (α ≈ 0.01)	(-∞, ∞)	Hidden layers when ReLU dies too much	Fixes dying ReLU by allowing small negative slope	Slight extra computation, α is a hyperparameter
PReLU (Parametric ReLU)	Same as Leaky ReLU but α is learned	(-∞, ∞)	Used in some top models (ResNet, etc.)	α adapts per channel → often better than fixed Leaky ReLU	More parameters
ELU (Exponential Linear Unit)	f(x) = x if x > 0 else α(eˣ − 1)	(-α, ∞)	Hidden layers when you want smoother negatives	Zero-centered mean, no dying neurons, faster learning in some cases	Slower than ReLU (exponential is expensive)
GELU (Gaussian Error Linear Unit)	x × Φ(x) ≈ 0.5x(1 + tanh(√(2/π)(x + 0.044715x³)))	(-∞, ∞)	Transformers (BERT, GPT, Llama, Grok, etc.)	Smooth, probabilistic interpretation, state-of-the-art performance	Slightly slower than ReLU
Swish / SiLU	x × sigmoid(x)	(-∞, ∞)	Modern CNNs, some Transformers	Often beats ReLU/GELU on many tasks, smooth, non-monotonic	Slightly slower
Mish	x × tanh(ln(1 + e˓))	(-∞, ∞)	Some vision models	Claims to outperform Swish in some cases	Rarely used in practice
Softmax	softmax(xᵢ) = e^(xᵢ) / Σⱼ e^(xⱼ)	(0,1) and sums to 1	Output layer for multi-class classification	Turns logits into probabilities	Can be numerically unstable (use logits directly in loss instead)
Linear / Identity	f(x) = x	(-∞, ∞)	Regression output layers, last layer of GAN generators	No squashing → can predict any real number	—

What You Should Actually Use

Task	Hidden Layers Activation	Output Layer Activation
Image classification (CNNs)	ReLU or GELU	Softmax (multi-class)
Transformers (GPT, BERT, etc.)	GELU (original) or Swish/SiLU	Softmax (or none for next-token)
Binary classification	ReLU / GELU	Sigmoid
Regression (predict numbers)	ReLU / GELU	Linear (no activation)
Deep networks (>20 layers)	GELU, Swish, or Leaky ReLU/PReLU	—
When training is unstable	Try Swish or GELU	—

Search This Blog

Cnuinformatica