Activation Functions


What is an activation function?

An activation function is a small mathematical function applied to the output of each neuron (node) in a neural network.

It decides: “Given the weighted sum of inputs this neuron received, what should it finally output?”

Without activation function → neuron output = just a linear combination (weighted sum + bias)

With activation function → neuron output = something more intelligent (usually non-linear)

Why Do We Need Activation Functions? (The Real Reason)

There are two big reasons:

Reason 1: To Introduce Non-Linearity (The Most Important Reason)

Real-world data is not linear.

Examples:

  • Is this a cat or dog in the photo? → Decision boundary is curved/complex
  • Will the stock price go up or down? → Highly non-linear
  • Will a customer click the ad? → Non-linear patterns

Fact: If you use only linear functions (or no activation), then even a 1000-layer neural network collapses into one single linear equation.

→ Your deep network becomes exactly the same as simple linear regression → extremely weak!

Non-linear activation functions (ReLU, GELU, Sigmoid, etc.) allow the network to learn curves, corners, complex patterns → that’s what makes deep learning powerful.

Reason 2: To Control and Stabilize Neuron Behavior

Different activations help in different ways:

Activation

What It Controls

Sigmoid

Squashes output to 0–1 → perfect for probability

ReLU

Turns off negative values → speeds up training

Tanh

Centers data around 0 → helps gradients flow

Softmax

Turns numbers into probabilities that sum to 1

 Simple Analogy

Think of a neuron as a tiny decision-maker in the brain.

  • Without activation function → It just passes the raw sum: “I got 5 votes, I say 5”
  • With activation function → It thinks: “I got 5 votes → that’s a lot → I’m very confident → output almost 1” “I got -3 votes → negative → I’m quiet → output 0”

That “thinking” step is the activation function.

Practical guide to the most important activation functions used in neural networks today (2025).

I’ll explain what they do, why they exist, when to use them, and their pros/cons—no fluff.

Activation Function

Formula

Output Range

Where It's Used

Pros

Cons / Problems

Sigmoid

σ(x) = 1 / (1 + e⁻ˣ)

(0, 1)

Old binary classifiers (output layer)

Gives probability-like output

Vanishing gradients, outputs saturate → kills learning in deep nets. Almost never used in hidden layers now.

Tanh (Hyperbolic Tangent)

tanh(x) = (eˣ − e⁻ˣ)/(eˣ + e⁻ˣ)

(-1, 1)

Hidden layers in older RNNs/LSTMs

Zero-centered → faster convergence than sigmoid

Still has vanishing gradients (just slower than sigmoid). Mostly replaced.

ReLU (Rectified Linear Unit)

f(x) = max(0, x)

[0, ∞)

Default choice for hidden layers in 2025

Very fast, no vanishing gradient for x > 0, biologically plausible

Dying ReLU problem: neurons can permanently output 0 and stop learning if input is always negative

Leaky ReLU

f(x) = x if x > 0 else αx (α ≈ 0.01)

(-∞, ∞)

Hidden layers when ReLU dies too much

Fixes dying ReLU by allowing small negative slope

Slight extra computation, α is a hyperparameter

PReLU (Parametric ReLU)

Same as Leaky ReLU but α is learned

(-∞, ∞)

Used in some top models (ResNet, etc.)

α adapts per channel → often better than fixed Leaky ReLU

More parameters

ELU (Exponential Linear Unit)

f(x) = x if x > 0 else α(eˣ − 1)

(-α, ∞)

Hidden layers when you want smoother negatives

Zero-centered mean, no dying neurons, faster learning in some cases

Slower than ReLU (exponential is expensive)

GELU (Gaussian Error Linear Unit)

x × Φ(x) ≈ 0.5x(1 + tanh(√(2/π)(x + 0.044715x³)))

(-∞, ∞)

Transformers (BERT, GPT, Llama, Grok, etc.)

Smooth, probabilistic interpretation, state-of-the-art performance

Slightly slower than ReLU

Swish / SiLU

x × sigmoid(x)

(-∞, ∞)

Modern CNNs, some Transformers

Often beats ReLU/GELU on many tasks, smooth, non-monotonic

Slightly slower

Mish

x × tanh(ln(1 + e˓))

(-∞, ∞)

Some vision models

Claims to outperform Swish in some cases

Rarely used in practice

Softmax

softmax(xᵢ) = e^(xᵢ) / Σⱼ e^(xⱼ)

(0,1) and sums to 1

Output layer for multi-class classification

Turns logits into probabilities

Can be numerically unstable (use logits directly in loss instead)

Linear / Identity

f(x) = x

(-∞, ∞)

Regression output layers, last layer of GAN generators

No squashing → can predict any real number

What You Should Actually Use

Task

Hidden Layers Activation

Output Layer Activation

Image classification (CNNs)

ReLU or GELU

Softmax (multi-class)

Transformers (GPT, BERT, etc.)

GELU (original) or Swish/SiLU

Softmax (or none for next-token)

Binary classification

ReLU / GELU

Sigmoid

Regression (predict numbers)

ReLU / GELU

Linear (no activation)

Deep networks (>20 layers)

GELU, Swish, or Leaky ReLU/PReLU

When training is unstable

Try Swish or GELU

Comments

Popular posts from this blog

About me

A set of documents that need to be classified, use the Naive Bayesian Classifier

Keras