Gradient Descent
Introduction
Gradient descent is an iterative optimization algorithm used to find the minimum of a cost function . The key idea is simple: to minimize a function, we move in the direction opposite to its gradient (the direction of steepest ascent).
Core Principle
At each iteration, we update the parameters:
Where:
- are the model parameters we want to optimize
- is the learning rate (step size)
- Too small → slow convergence (many iterations needed)
- Too large → may overshoot and diverge
- is the gradient (direction of steepest ascent), so we subtract it to descend
The algorithm continues until it reaches a minimum (where the gradient is zero or very small).
Batch Gradient Descent
Batch Gradient Descent (often called "vanilla" gradient descent) computes the gradient using the entire training dataset in each iteration.
Algorithm
Initialize: randomly (or to zeros), set learning rate
Repeat until stopping criteria is met:
Where the gradient is computed over all training examples:
Stochastic Gradient Descent (SGD)
To address the computational bottleneck of Batch GD, Stochastic Gradient Descent uses one randomly selected training example at each iteration instead of the entire dataset.
Algorithm
Initialize: randomly (or to zeros), set learning rate
Repeat until stopping criteria is met:
Randomly sample an example
Compute the gradient for this single example:
Update parameters:
Mini-batch Gradient Descent
Mini-batch Gradient Descent is the practical compromise between Batch GD and SGD, using a batch size (typically 32-256 examples).
Algorithm
Initialize: randomly, set learning rate and batch size
Repeat for each epoch (until stopping):
Shuffle all data points randomly
For each mini-batch of size :
Compute gradient over the batch:
Update parameters:
Comparison of Methods
All three methods follow similar patterns but differ in how they compute gradients. Here's a comprehensive comparison:
| Method | Examples per Update | Speed | Convergence | Memory | Pros | Cons | Best Use Case |
|---|---|---|---|---|---|---|---|
| Batch GD | All examples | Slow | Smooth, stable, guaranteed for convex | High | ✅ Exact gradient ✅ Guaranteed convergence (convex) ✅ Smooth loss curve | ❌ Slow per iteration ❌ High memory ❌ Impractical for large data | Small datasets, convex problems |
| SGD | 1 example | Very fast | Noisy, explores more | Low | ✅ Fast per iteration ✅ Memory efficient ✅ Escapes local minima ✅ Online learning | ❌ Noisy updates ❌ Many iterations needed ❌ Requires learning rate tuning | Online learning, huge datasets |
| Mini-batch GD | examples (32-256) | Fast | Balanced | Medium | ✅ Speed/stability balance ✅ GPU vectorization ✅ Moderate memory ✅ Industry standard | ❌ Requires batch size tuning ❌ Slightly more complex | Deep learning (most common) |
Start with mini-batch GD (batch size 32-256) as the default choice. It offers the best balance of speed, stability, and memory efficiency for most applications.
Stopping Criteria
Gradient descent algorithms need to know when to stop iterating. Here are the main criteria:
1. Gradient is Zero (Theoretical Optimum)
Most intuitive for convex problems like linear regression:
At the minimum of a convex function, the gradient is exactly zero. This gives us the analytical solution (e.g., Normal Equation for linear regression).
In practice, we check if the gradient is close to zero:
Where is a small threshold (e.g., ). This indicates the model is near a minimum.
For convex functions (like linear regression with MSE loss), this guarantees we've found the global minimum!
2. Loss Convergence
Stop when the loss function stops decreasing significantly:
This means the optimization has plateaued and further iterations won't improve much.
3. Validation Performance Plateau (Early Stopping)
Stop when performance on a held-out validation set stops improving:
if val_loss_current > val_loss_best for k consecutive epochs:
stop training # Early stopping
This is called early stopping and helps prevent overfitting. It's especially useful when training neural networks.
4. Maximum Iterations Reached
Set a fixed number of iterations (epochs) beforehand:
max_epochs = 100
for epoch in range(max_epochs):
# Gradient descent updates
This is a practical constraint to avoid infinite loops.
5. Time Limit
Stop after a fixed amount of wall-clock time (practical constraint for production systems).
Practical Tips
- Start with mini-batch GD (batch size 32-256) as the default choice
- Shuffle the training data at the start of each epoch for better convergence
- Use learning rate decay to get smoother convergence near the minimum
- Monitor validation loss for early stopping to prevent overfitting
- Use modern optimizers (Adam, RMSprop) for adaptive learning rates
Advanced Variants
Modern optimizers build on basic gradient descent with improvements for faster and more stable convergence.
SGD with Momentum
Accelerates convergence by accumulating a velocity vector (like a ball rolling downhill):
Where controls how much past gradients influence the current update.
Benefits: Smooths out oscillations, accelerates in consistent directions, helps escape plateaus.
Adam (Adaptive Moment Estimation)
The most popular modern optimizer. Combines momentum with adaptive learning rates for each parameter:
Where , , .
Benefits: Adapts learning rate per parameter, works well with default hyperparameters, widely used in deep learning.
When using gradient descent to train neural networks, you'll also need to choose appropriate activation functions that are non-linear and differentiable for backpropagation.
Summary
Gradient Descent is a family of fundamental optimization algorithms in machine learning:
- Batch GD: Uses all examples, stable but slow — best for small datasets and convex problems
- SGD: Uses one example at a time, fast but noisy — best for online learning and huge datasets
- Mini-batch GD: Uses small batches (32-256), industry standard — best for deep learning
Key Concepts:
- Stop when gradient ≈ 0 (for convex problems), or use early stopping on validation set
- Modern variants (Momentum, Adam) provide faster and more stable convergence
- For neural networks, see Activation Functions for non-linear transformations