Gradient Descent

Introduction

Gradient descent is an iterative optimization algorithm used to find the minimum of a cost function $J(\theta)$ . The key idea is simple: to minimize a function, we move in the direction opposite to its gradient (the direction of steepest ascent).

Core Principle

At each iteration, we update the parameters:

\theta^{(t+1)} = \theta^{(t)} - \alpha \nabla J(\theta^{(t)})

Where:

$\theta$ are the model parameters we want to optimize
$\alpha$ is the learning rate (step size)
- Too small → slow convergence (many iterations needed)
- Too large → may overshoot and diverge
$\nabla J(\theta)$ is the gradient (direction of steepest ascent), so we subtract it to descend

The algorithm continues until it reaches a minimum (where the gradient is zero or very small).

Batch Gradient Descent

Batch Gradient Descent (often called "vanilla" gradient descent) computes the gradient using the entire training dataset in each iteration.

Algorithm

Initialize: $\theta^{(0)}$ randomly (or to zeros), set learning rate $\alpha$

Repeat until stopping criteria is met:

\theta^{(t+1)} = \theta^{(t)} - \alpha \nabla J(\theta^{(t)})

Where the gradient is computed over all $n$ training examples:

\nabla J(\theta) = \frac{1}{n} \sum_{i=1}^{n} (h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}

Stochastic Gradient Descent (SGD)

To address the computational bottleneck of Batch GD, Stochastic Gradient Descent uses one randomly selected training example at each iteration instead of the entire dataset.

Algorithm

Initialize: $\theta^{(0)}$ randomly (or to zeros), set learning rate $\alpha$

Repeat until stopping criteria is met:

Randomly sample an example $i \in \{1, ..., n\}$

Compute the gradient for this single example:

$\nabla J^{(i)}(\theta^{(t)}) = (h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}$

Update parameters:

$\theta^{(t+1)} = \theta^{(t)} - \alpha \nabla J^{(i)}(\theta^{(t)})$

Mini-batch Gradient Descent

Mini-batch Gradient Descent is the practical compromise between Batch GD and SGD, using a batch size $s$ (typically 32-256 examples).

Algorithm

Initialize: $\theta^{(0)}$ randomly, set learning rate $\alpha$ and batch size $s$

Repeat for each epoch (until stopping):

Shuffle all $n$ data points randomly

For each mini-batch $B$ of size $s$ :

Compute gradient over the batch:

$\nabla J^B(\theta) = \frac{1}{s} \sum_{i \in B} (h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}$

Update parameters:

$\theta^{(t+1)} = \theta^{(t)} - \alpha \nabla J^B(\theta^{(t)})$

Comparison of Methods

All three methods follow similar patterns but differ in how they compute gradients. Here's a comprehensive comparison:

Method	Examples per Update	Speed	Convergence	Memory	Pros	Cons	Best Use Case
Batch GD	All $n$ examples	Slow	Smooth, stable, guaranteed for convex	High	✅ Exact gradient ✅ Guaranteed convergence (convex) ✅ Smooth loss curve	❌ Slow per iteration ❌ High memory ❌ Impractical for large data	Small datasets, convex problems
SGD	1 example	Very fast	Noisy, explores more	Low	✅ Fast per iteration ✅ Memory efficient ✅ Escapes local minima ✅ Online learning	❌ Noisy updates ❌ Many iterations needed ❌ Requires learning rate tuning	Online learning, huge datasets
Mini-batch GD	$s$ examples (32-256)	Fast	Balanced	Medium	✅ Speed/stability balance ✅ GPU vectorization ✅ Moderate memory ✅ Industry standard	❌ Requires batch size tuning ❌ Slightly more complex	Deep learning (most common)

Best Practice

Start with mini-batch GD (batch size 32-256) as the default choice. It offers the best balance of speed, stability, and memory efficiency for most applications.

Stopping Criteria

Gradient descent algorithms need to know when to stop iterating. Here are the main criteria:

1. Gradient is Zero (Theoretical Optimum)

Most intuitive for convex problems like linear regression:

\nabla J(\theta^{(t)}) = 0

At the minimum of a convex function, the gradient is exactly zero. This gives us the analytical solution (e.g., Normal Equation for linear regression).

In practice, we check if the gradient is close to zero:

\|\nabla J(\theta^{(t)})\| < \epsilon

Where $\epsilon$ is a small threshold (e.g., $10^{-6}$ ). This indicates the model is near a minimum.

note

For convex functions (like linear regression with MSE loss), this guarantees we've found the global minimum!

2. Loss Convergence

Stop when the loss function stops decreasing significantly:

|J(\theta^{(t)}) - J(\theta^{(t-1)})| < \epsilon

This means the optimization has plateaued and further iterations won't improve much.

3. Validation Performance Plateau (Early Stopping)

Stop when performance on a held-out validation set stops improving:

if val_loss_current > val_loss_best for k consecutive epochs:
    stop training  # Early stopping

This is called early stopping and helps prevent overfitting. It's especially useful when training neural networks.

4. Maximum Iterations Reached

Set a fixed number of iterations (epochs) beforehand:

max_epochs = 100
for epoch in range(max_epochs):
    # Gradient descent updates

This is a practical constraint to avoid infinite loops.

5. Time Limit

Stop after a fixed amount of wall-clock time (practical constraint for production systems).

Practical Tips

Start with mini-batch GD (batch size 32-256) as the default choice
Shuffle the training data at the start of each epoch for better convergence
Use learning rate decay to get smoother convergence near the minimum
Monitor validation loss for early stopping to prevent overfitting
Use modern optimizers (Adam, RMSprop) for adaptive learning rates

Advanced Variants

Modern optimizers build on basic gradient descent with improvements for faster and more stable convergence.

SGD with Momentum

Accelerates convergence by accumulating a velocity vector (like a ball rolling downhill):

v^{(t+1)} = \beta v^{(t)} + \alpha \nabla J^{(i)}(\theta^{(t)})

\theta^{(t+1)} = \theta^{(t)} - v^{(t+1)}

Where $\beta \approx 0.9$ controls how much past gradients influence the current update.

Benefits: Smooths out oscillations, accelerates in consistent directions, helps escape plateaus.

Adam (Adaptive Moment Estimation)

The most popular modern optimizer. Combines momentum with adaptive learning rates for each parameter:

m^{(t)} = \beta_1 m^{(t-1)} + (1 - \beta_1) \nabla J^{(i)}(\theta^{(t)})

v^{(t)} = \beta_2 v^{(t-1)} + (1 - \beta_2) (\nabla J^{(i)}(\theta^{(t)}))^2

\theta^{(t+1)} = \theta^{(t)} - \alpha \frac{m^{(t)}}{\sqrt{v^{(t)}} + \epsilon}

Where $\beta_1 \approx 0.9$ , $\beta_2 \approx 0.999$ , $\epsilon = 10^{-8}$ .

Benefits: Adapts learning rate per parameter, works well with default hyperparameters, widely used in deep learning.

Neural Networks

When using gradient descent to train neural networks, you'll also need to choose appropriate activation functions that are non-linear and differentiable for backpropagation.

Summary

Gradient Descent is a family of fundamental optimization algorithms in machine learning:

Batch GD: Uses all examples, stable but slow — best for small datasets and convex problems
SGD: Uses one example at a time, fast but noisy — best for online learning and huge datasets
Mini-batch GD: Uses small batches (32-256), industry standard — best for deep learning

Key Concepts:

Stop when gradient ≈ 0 (for convex problems), or use early stopping on validation set
Modern variants (Momentum, Adam) provide faster and more stable convergence
For neural networks, see Activation Functions for non-linear transformations

Gradient Descent

Introduction​

Core Principle​

Batch Gradient Descent​

Algorithm​

Stochastic Gradient Descent (SGD)​

Algorithm​

Mini-batch Gradient Descent​

Algorithm​

Comparison of Methods​

Stopping Criteria​

1. Gradient is Zero (Theoretical Optimum)​

2. Loss Convergence​

3. Validation Performance Plateau (Early Stopping)​

4. Maximum Iterations Reached​

5. Time Limit​

Practical Tips​

Advanced Variants​

SGD with Momentum​

Adam (Adaptive Moment Estimation)​

Summary​

Introduction

Core Principle

Batch Gradient Descent

Algorithm

Stochastic Gradient Descent (SGD)

Algorithm

Mini-batch Gradient Descent

Algorithm

Comparison of Methods

Stopping Criteria

1. Gradient is Zero (Theoretical Optimum)

2. Loss Convergence

3. Validation Performance Plateau (Early Stopping)

4. Maximum Iterations Reached

5. Time Limit

Practical Tips

Advanced Variants

SGD with Momentum

Adam (Adaptive Moment Estimation)

Summary