Activation Functions
Activation functions are a fundamental component of neural networks that introduce non-linearity into the model, enabling neural networks to learn complex patterns and relationships in data.
Why Do We Need Activation Functions?
Non-Linearity is Essential
If activation functions were linear (or absent), stacking multiple layers would still result in a linear transformation:
This would make deep networks no more powerful than a single-layer network. Non-linearity allows neural networks to learn complex patterns, decision boundaries, and hierarchical representations.
Differentiability for Gradient-Based Optimization
For gradient-based optimization (like gradient descent), activation functions must be differentiable. The gradient of the loss with respect to weights requires taking derivatives through the activation functions (via backpropagation):
If the activation function is not differentiable, we cannot compute these gradients for backpropagation.
Common Activation Functions
| Activation | Formula | Derivative | Range | Properties |
|---|---|---|---|---|
| Sigmoid | (0,1) | Output interpretable as probability, prone to vanishing gradients | ||
| Tanh | (-1,1) | Zero-centered, still suffers from vanishing gradients | ||
| ReLU | [0,∞) | Most popular, computationally efficient, can suffer from "dying ReLU" | ||
| Leaky ReLU | (-∞,∞) | Prevents dying ReLU problem | ||
| ELU | (-α,∞) | Smooth, self-normalizing, can be slow to compute | ||
| Softmax | (0,1) | Used for multi-class classification (output layer) |
ReLU is technically not differentiable at , but in practice we define (or 1), which works well.
Detailed Analysis
Sigmoid Function
Pros:
- Output range (0,1) can be interpreted as probabilities
- Smooth gradient
- Historically important
Cons:
- Vanishing gradient problem: For very large or small inputs, gradient becomes very small (≈0), slowing learning
- Not zero-centered: Can cause zig-zagging dynamics during gradient descent
- Computationally expensive (exponential)
Use cases: Binary classification (output layer), gates in LSTM/GRU
Tanh (Hyperbolic Tangent)
Pros:
- Zero-centered output (better than sigmoid)
- Smooth gradient
Cons:
- Still suffers from vanishing gradient problem
- Computationally expensive
Use cases: Hidden layers in RNNs, gates in LSTM/GRU
ReLU (Rectified Linear Unit)
Pros:
- Very computationally efficient: Simple max operation
- Helps mitigate vanishing gradient problem
- Encourages sparse activations (many neurons output 0)
- Empirically shown to accelerate convergence
Cons:
- Dying ReLU problem: Neurons can permanently "die" (always output 0) if they get stuck during training
- Not zero-centered
- Unbounded output
Use cases: Default choice for hidden layers in feedforward and convolutional neural networks
Leaky ReLU
Pros:
- Fixes the dying ReLU problem (negative inputs have small non-zero gradient)
- Computationally efficient
Cons:
- The slope for negative values (0.01) is a hyperparameter
- Not always better than ReLU in practice
Variants:
- Parametric ReLU (PReLU): Learns the slope parameter during training
- Randomized Leaky ReLU (RReLU): Randomizes the slope during training
ELU (Exponential Linear Unit)
Pros:
- Smooth everywhere (including at 0)
- Negative saturation helps make network more robust to noise
- Closer to zero-mean outputs
- Can lead to faster learning
Cons:
- Computationally more expensive due to exponential
- Requires tuning of hyperparameter
Softmax
Properties:
- Outputs sum to 1 (can be interpreted as probability distribution)
- Differentiable
- Sensitive to outliers (large values dominate)
Use cases: Multi-class classification (output layer), attention mechanisms
Choosing Activation Functions
For Hidden Layers
- Start with ReLU - It's the default choice and works well in most cases
- Try Leaky ReLU / ELU - If you encounter dying ReLU problem
- Use Tanh - For RNNs and when you need zero-centered outputs
- Avoid Sigmoid - In hidden layers due to vanishing gradient problem
For Output Layer
The choice depends on your task:
| Task | Activation | Reason |
|---|---|---|
| Binary Classification | Sigmoid | Outputs probability between 0 and 1 |
| Multi-class Classification | Softmax | Outputs probability distribution over classes |
| Regression | Linear (None) | Unrestricted output range |
| Positive Regression | ReLU / Softplus | Ensures positive outputs |
Modern Trends
GELU (Gaussian Error Linear Unit)
Used in transformers (BERT, GPT):
Where is the cumulative distribution function of the standard normal distribution.
Swish (SiLU)
Self-gated activation:
Shown to outperform ReLU in deep networks.
Summary
- Activation functions introduce non-linearity, essential for neural networks to learn complex patterns
- ReLU is the default choice for hidden layers in most architectures
- Sigmoid and Softmax are used in output layers for classification
- Modern alternatives (GELU, Swish) show promise in specific architectures
- Choice depends on architecture, task, and empirical performance