Activation Functions

Activation functions are a fundamental component of neural networks that introduce non-linearity into the model, enabling neural networks to learn complex patterns and relationships in data.

Why Do We Need Activation Functions?

Non-Linearity is Essential

If activation functions were linear (or absent), stacking multiple layers would still result in a linear transformation:

f(W_2 \cdot f(W_1 \cdot x)) = f(W_2 W_1 x) = W_{combined} x

This would make deep networks no more powerful than a single-layer network. Non-linearity allows neural networks to learn complex patterns, decision boundaries, and hierarchical representations.

Differentiability for Gradient-Based Optimization

For gradient-based optimization (like gradient descent), activation functions must be differentiable. The gradient of the loss with respect to weights requires taking derivatives through the activation functions (via backpropagation):

\frac{\partial J}{\partial \theta} = \frac{\partial J}{\partial h} \cdot \frac{\partial h}{\partial \theta}

If the activation function is not differentiable, we cannot compute these gradients for backpropagation.

Common Activation Functions

Activation	Formula	Derivative	Range	Properties
Sigmoid	$\sigma(x) = \frac{1}{1 + e^{-x}}$	$\sigma'(x) = \sigma(x)(1 - \sigma(x))$	(0,1)	Output interpretable as probability, prone to vanishing gradients
Tanh	$\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$	$\tanh'(x) = 1 - \tanh^2(x)$	(-1,1)	Zero-centered, still suffers from vanishing gradients
ReLU	$\text{ReLU}(x) = \max(0, x)$	$\text{ReLU}'(x) = \begin{cases} 1 & x > 0 \\ 0 & x \leq 0 \end{cases}$	[0,∞)	Most popular, computationally efficient, can suffer from "dying ReLU"
Leaky ReLU	$\text{LeakyReLU}(x) = \max(0.01x, x)$	$\text{LeakyReLU}'(x) = \begin{cases} 1 & x > 0 \\ 0.01 & x \leq 0 \end{cases}$	(-∞,∞)	Prevents dying ReLU problem
ELU	$\text{ELU}(x) = \begin{cases} x & x > 0 \\ \alpha(e^x - 1) & x \leq 0 \end{cases}$	$\text{ELU}'(x) = \begin{cases} 1 & x > 0 \\ \text{ELU}(x) + \alpha & x \leq 0 \end{cases}$	(-α,∞)	Smooth, self-normalizing, can be slow to compute
Softmax	$\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}$	$\frac{\partial \text{softmax}(x_i)}{\partial x_j} = \text{softmax}(x_i)(\delta_{ij} - \text{softmax}(x_j))$	(0,1)	Used for multi-class classification (output layer)

note

ReLU is technically not differentiable at $x=0$ , but in practice we define $\text{ReLU}'(0) = 0$ (or 1), which works well.

Detailed Analysis

Sigmoid Function

\sigma(x) = \frac{1}{1 + e^{-x}}

Pros:

Output range (0,1) can be interpreted as probabilities
Smooth gradient
Historically important

Cons:

Vanishing gradient problem: For very large or small inputs, gradient becomes very small (≈0), slowing learning
Not zero-centered: Can cause zig-zagging dynamics during gradient descent
Computationally expensive (exponential)

Use cases: Binary classification (output layer), gates in LSTM/GRU

Tanh (Hyperbolic Tangent)

\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = 2\sigma(2x) - 1

Pros:

Zero-centered output (better than sigmoid)
Smooth gradient

Cons:

Still suffers from vanishing gradient problem
Computationally expensive

Use cases: Hidden layers in RNNs, gates in LSTM/GRU

ReLU (Rectified Linear Unit)

\text{ReLU}(x) = \max(0, x)

Pros:

Very computationally efficient: Simple max operation
Helps mitigate vanishing gradient problem
Encourages sparse activations (many neurons output 0)
Empirically shown to accelerate convergence

Cons:

Dying ReLU problem: Neurons can permanently "die" (always output 0) if they get stuck during training
Not zero-centered
Unbounded output

Use cases: Default choice for hidden layers in feedforward and convolutional neural networks

Leaky ReLU

\text{LeakyReLU}(x) = \max(0.01x, x) = \begin{cases} x & x > 0 \\ 0.01x & x \leq 0 \end{cases}

Pros:

Fixes the dying ReLU problem (negative inputs have small non-zero gradient)
Computationally efficient

Cons:

The slope for negative values (0.01) is a hyperparameter
Not always better than ReLU in practice

Variants:

Parametric ReLU (PReLU): Learns the slope parameter during training
Randomized Leaky ReLU (RReLU): Randomizes the slope during training

ELU (Exponential Linear Unit)

\text{ELU}(x) = \begin{cases} x & x > 0 \\ \alpha(e^x - 1) & x \leq 0 \end{cases}

Pros:

Smooth everywhere (including at 0)
Negative saturation helps make network more robust to noise
Closer to zero-mean outputs
Can lead to faster learning

Cons:

Computationally more expensive due to exponential
Requires tuning of $\alpha$ hyperparameter

Softmax

\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{n} e^{x_j}}

Properties:

Outputs sum to 1 (can be interpreted as probability distribution)
Differentiable
Sensitive to outliers (large values dominate)

Use cases: Multi-class classification (output layer), attention mechanisms

Choosing Activation Functions

For Hidden Layers

Start with ReLU - It's the default choice and works well in most cases
Try Leaky ReLU / ELU - If you encounter dying ReLU problem
Use Tanh - For RNNs and when you need zero-centered outputs
Avoid Sigmoid - In hidden layers due to vanishing gradient problem

For Output Layer

The choice depends on your task:

Task	Activation	Reason
Binary Classification	Sigmoid	Outputs probability between 0 and 1
Multi-class Classification	Softmax	Outputs probability distribution over classes
Regression	Linear (None)	Unrestricted output range
Positive Regression	ReLU / Softplus	Ensures positive outputs

Modern Trends

GELU (Gaussian Error Linear Unit)

Used in transformers (BERT, GPT):

\text{GELU}(x) = x \cdot \Phi(x)

Where $\Phi(x)$ is the cumulative distribution function of the standard normal distribution.

Swish (SiLU)

Self-gated activation:

\text{Swish}(x) = x \cdot \sigma(x)

Shown to outperform ReLU in deep networks.

Summary

Activation functions introduce non-linearity, essential for neural networks to learn complex patterns
ReLU is the default choice for hidden layers in most architectures
Sigmoid and Softmax are used in output layers for classification
Modern alternatives (GELU, Swish) show promise in specific architectures
Choice depends on architecture, task, and empirical performance

Activation Functions

Why Do We Need Activation Functions?​

Non-Linearity is Essential​

Differentiability for Gradient-Based Optimization​

Common Activation Functions​

Detailed Analysis​

Sigmoid Function​

Tanh (Hyperbolic Tangent)​

ReLU (Rectified Linear Unit)​

Leaky ReLU​

ELU (Exponential Linear Unit)​

Softmax​

Choosing Activation Functions​

For Hidden Layers​

For Output Layer​

Modern Trends​

GELU (Gaussian Error Linear Unit)​

Swish (SiLU)​

Summary​

Why Do We Need Activation Functions?

Non-Linearity is Essential

Differentiability for Gradient-Based Optimization

Common Activation Functions

Detailed Analysis

Sigmoid Function

Tanh (Hyperbolic Tangent)

ReLU (Rectified Linear Unit)

Leaky ReLU

ELU (Exponential Linear Unit)

Softmax

Choosing Activation Functions

For Hidden Layers

For Output Layer

Modern Trends

GELU (Gaussian Error Linear Unit)

Swish (SiLU)

Summary