Skip to main content

Activation Functions

Activation functions are a fundamental component of neural networks that introduce non-linearity into the model, enabling neural networks to learn complex patterns and relationships in data.

Why Do We Need Activation Functions?

Non-Linearity is Essential

If activation functions were linear (or absent), stacking multiple layers would still result in a linear transformation:

f(W2f(W1x))=f(W2W1x)=Wcombinedxf(W_2 \cdot f(W_1 \cdot x)) = f(W_2 W_1 x) = W_{combined} x

This would make deep networks no more powerful than a single-layer network. Non-linearity allows neural networks to learn complex patterns, decision boundaries, and hierarchical representations.

Differentiability for Gradient-Based Optimization

For gradient-based optimization (like gradient descent), activation functions must be differentiable. The gradient of the loss with respect to weights requires taking derivatives through the activation functions (via backpropagation):

Jθ=Jhhθ\frac{\partial J}{\partial \theta} = \frac{\partial J}{\partial h} \cdot \frac{\partial h}{\partial \theta}

If the activation function is not differentiable, we cannot compute these gradients for backpropagation.

Common Activation Functions

ActivationFormulaDerivativeRangeProperties
Sigmoidσ(x)=11+ex\sigma(x) = \frac{1}{1 + e^{-x}}σ(x)=σ(x)(1σ(x))\sigma'(x) = \sigma(x)(1 - \sigma(x))(0,1)Output interpretable as probability, prone to vanishing gradients
Tanhtanh(x)=exexex+ex\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}tanh(x)=1tanh2(x)\tanh'(x) = 1 - \tanh^2(x)(-1,1)Zero-centered, still suffers from vanishing gradients
ReLUReLU(x)=max(0,x)\text{ReLU}(x) = \max(0, x)ReLU(x)={1x>00x0\text{ReLU}'(x) = \begin{cases} 1 & x > 0 \\ 0 & x \leq 0 \end{cases}[0,∞)Most popular, computationally efficient, can suffer from "dying ReLU"
Leaky ReLULeakyReLU(x)=max(0.01x,x)\text{LeakyReLU}(x) = \max(0.01x, x)LeakyReLU(x)={1x>00.01x0\text{LeakyReLU}'(x) = \begin{cases} 1 & x > 0 \\ 0.01 & x \leq 0 \end{cases}(-∞,∞)Prevents dying ReLU problem
ELUELU(x)={xx>0α(ex1)x0\text{ELU}(x) = \begin{cases} x & x > 0 \\ \alpha(e^x - 1) & x \leq 0 \end{cases}ELU(x)={1x>0ELU(x)+αx0\text{ELU}'(x) = \begin{cases} 1 & x > 0 \\ \text{ELU}(x) + \alpha & x \leq 0 \end{cases}(-α,∞)Smooth, self-normalizing, can be slow to compute
Softmaxsoftmax(xi)=exijexj\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}softmax(xi)xj=softmax(xi)(δijsoftmax(xj))\frac{\partial \text{softmax}(x_i)}{\partial x_j} = \text{softmax}(x_i)(\delta_{ij} - \text{softmax}(x_j))(0,1)Used for multi-class classification (output layer)
note

ReLU is technically not differentiable at x=0x=0, but in practice we define ReLU(0)=0\text{ReLU}'(0) = 0 (or 1), which works well.

Detailed Analysis

Sigmoid Function

σ(x)=11+ex\sigma(x) = \frac{1}{1 + e^{-x}}

Pros:

  • Output range (0,1) can be interpreted as probabilities
  • Smooth gradient
  • Historically important

Cons:

  • Vanishing gradient problem: For very large or small inputs, gradient becomes very small (≈0), slowing learning
  • Not zero-centered: Can cause zig-zagging dynamics during gradient descent
  • Computationally expensive (exponential)

Use cases: Binary classification (output layer), gates in LSTM/GRU

Tanh (Hyperbolic Tangent)

tanh(x)=exexex+ex=2σ(2x)1\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = 2\sigma(2x) - 1

Pros:

  • Zero-centered output (better than sigmoid)
  • Smooth gradient

Cons:

  • Still suffers from vanishing gradient problem
  • Computationally expensive

Use cases: Hidden layers in RNNs, gates in LSTM/GRU

ReLU (Rectified Linear Unit)

ReLU(x)=max(0,x)\text{ReLU}(x) = \max(0, x)

Pros:

  • Very computationally efficient: Simple max operation
  • Helps mitigate vanishing gradient problem
  • Encourages sparse activations (many neurons output 0)
  • Empirically shown to accelerate convergence

Cons:

  • Dying ReLU problem: Neurons can permanently "die" (always output 0) if they get stuck during training
  • Not zero-centered
  • Unbounded output

Use cases: Default choice for hidden layers in feedforward and convolutional neural networks

Leaky ReLU

LeakyReLU(x)=max(0.01x,x)={xx>00.01xx0\text{LeakyReLU}(x) = \max(0.01x, x) = \begin{cases} x & x > 0 \\ 0.01x & x \leq 0 \end{cases}

Pros:

  • Fixes the dying ReLU problem (negative inputs have small non-zero gradient)
  • Computationally efficient

Cons:

  • The slope for negative values (0.01) is a hyperparameter
  • Not always better than ReLU in practice

Variants:

  • Parametric ReLU (PReLU): Learns the slope parameter during training
  • Randomized Leaky ReLU (RReLU): Randomizes the slope during training

ELU (Exponential Linear Unit)

ELU(x)={xx>0α(ex1)x0\text{ELU}(x) = \begin{cases} x & x > 0 \\ \alpha(e^x - 1) & x \leq 0 \end{cases}

Pros:

  • Smooth everywhere (including at 0)
  • Negative saturation helps make network more robust to noise
  • Closer to zero-mean outputs
  • Can lead to faster learning

Cons:

  • Computationally more expensive due to exponential
  • Requires tuning of α\alpha hyperparameter

Softmax

softmax(xi)=exij=1nexj\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{n} e^{x_j}}

Properties:

  • Outputs sum to 1 (can be interpreted as probability distribution)
  • Differentiable
  • Sensitive to outliers (large values dominate)

Use cases: Multi-class classification (output layer), attention mechanisms

Choosing Activation Functions

For Hidden Layers

  1. Start with ReLU - It's the default choice and works well in most cases
  2. Try Leaky ReLU / ELU - If you encounter dying ReLU problem
  3. Use Tanh - For RNNs and when you need zero-centered outputs
  4. Avoid Sigmoid - In hidden layers due to vanishing gradient problem

For Output Layer

The choice depends on your task:

TaskActivationReason
Binary ClassificationSigmoidOutputs probability between 0 and 1
Multi-class ClassificationSoftmaxOutputs probability distribution over classes
RegressionLinear (None)Unrestricted output range
Positive RegressionReLU / SoftplusEnsures positive outputs

GELU (Gaussian Error Linear Unit)

Used in transformers (BERT, GPT):

GELU(x)=xΦ(x)\text{GELU}(x) = x \cdot \Phi(x)

Where Φ(x)\Phi(x) is the cumulative distribution function of the standard normal distribution.

Swish (SiLU)

Self-gated activation:

Swish(x)=xσ(x)\text{Swish}(x) = x \cdot \sigma(x)

Shown to outperform ReLU in deep networks.

Summary

  • Activation functions introduce non-linearity, essential for neural networks to learn complex patterns
  • ReLU is the default choice for hidden layers in most architectures
  • Sigmoid and Softmax are used in output layers for classification
  • Modern alternatives (GELU, Swish) show promise in specific architectures
  • Choice depends on architecture, task, and empirical performance