Skip to main content

Linear Regression

What is Linear Regression

visualization
Palo Alto Housing Prices: 3D Linear Regression
0.01 (Slow)0.15 (Fast)
0 (None)200 (Converged)
Intercept (θ₀)
$0.000M
Distance (θ₁)
0.000
Room Size (θ₂)
0.000
MSE
0.0000

Linear Regression is a supervised learning algorithm, which assumes a linear relationship between features (X) and a target (y). We can approximate our target y, as a linear function of x, such that:

hθ(x)=θ0+θ1x1+θ2x2+...+θdxd=i=0dθixi=θTxh_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + ... + \theta_d x_d = \sum_{i=0}^{d} \theta_i x_i = \theta^T x

with parameters (or weights) θi\theta_i's

Then, we minimize the least squares objective:

J(θ)=i=1n(hθ(x(i))y(i))2J(\theta) = \sum_{i=1}^{n} \left(h_\theta(x^{(i)}) - y^{(i)}\right)^2

Consider y(i)=θTx(i)+ε(i)y^{(i)} = \theta^T x^{(i)} + \varepsilon^{(i)} where ε(i)N(0,σ2)\varepsilon^{(i)} \sim \mathcal{N}(0, \sigma^2) independently and identically distributed (iid).

N(μ,σ2)\mathcal{N}(\mu, \sigma^2) is the Normal (Gaussian) distribution with probability density function:

f(x)=12πσ2e(xμ)22σ2f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}

Maximum Likelihood Estimation (MLE)

L(θ)=i=1nP(y(i);x(i),θ)L(\theta) = \prod_{i=1}^{n} P(y^{(i)}; x^{(i)}, \theta)
=i=1n12πσexp((y(i)θTx(i))22σ2)= \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma}} \exp\left(-\frac{(y^{(i)} - \theta^T x^{(i)})^2}{2\sigma^2}\right)

Log-Likelihood (easier to work with):

(θ)=logL(θ)\ell(\theta) = \log L(\theta)
=logi=1n12πσexp((y(i)θTx(i))22σ2)= \log \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma}} \exp\left(-\frac{(y^{(i)} - \theta^T x^{(i)})^2}{2\sigma^2}\right)
=i=1nlog12πσexp((y(i)θTx(i))22σ2)= \sum_{i=1}^{n} \log \frac{1}{\sqrt{2\pi\sigma}} \exp\left(-\frac{(y^{(i)} - \theta^T x^{(i)})^2}{2\sigma^2}\right)
=nlog12πσ1σ212i=1n(y(i)θTx(i))2= n \log \frac{1}{\sqrt{2\pi\sigma}} - \frac{1}{\sigma^2} \cdot \frac{1}{2} \sum_{i=1}^{n} (y^{(i)} - \theta^T x^{(i)})^2
(θ)=minθi=1n(y(i)θTx(i))2\ell(\theta) = \min_{\theta} \sum_{i=1}^{n} (y^{(i)} - \theta^T x^{(i)})^2
tip

i=1n(y(i)θTx(i))2\sum_{i=1}^{n} (y^{(i)} - \theta^T x^{(i)})^2 is also called Sum of Squared Errors (SSE).

Cost Function

The Mean Squared Error (MSE) cost function:

J(θ)=12ni=1n(hθ(x(i))y(i))2J(\theta) = \frac{1}{2n} \sum_{i=1}^{n} (h_\theta(x^{(i)}) - y^{(i)})^2
note

The factor 1/2n1/2n can be added to the cost function because it's a constant with respect to θ\theta in argminθ\arg\min_\theta. It doesn't affect the location of the minimum, but makes derivatives easier (cleaner factors) when optimizing.

J(θ)=1ni=1n(hθ(x(i))y(i))x(i)\nabla J(\theta) = \frac{1}{n} \sum_{i=1}^{n} (h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}

Gradient Descent

Gradient Descent is an iterative optimization algorithm to find the minimum of the cost function.

Algorithm:

Initialize θ(0)\theta^{(0)} randomly (or to zeros)

For t=0,1,2,...t = 0, 1, 2, ... (until convergence):

θ(t+1)=θ(t)αJ(θ(t))\theta^{(t+1)} = \theta^{(t)} - \alpha \nabla J(\theta^{(t)})

Key Parameters:

  • α\alpha is the learning rate (step size)
    • Too small → slow convergence (need more iterations)
    • Too large → may overshoot and diverge
  • tt is the iteration number
  • The gradient J(θ)\nabla J(\theta) points in the direction of steepest ascent, so we subtract it to descend

Stochastic Gradient Descent

Instead of computing the gradient over all nn examples, Stochastic Gradient Descent (SGD) uses one random example at a time:

Algorithm:

Initialize θ(0)\theta^{(0)} randomly

For t=0,1,2,...t = 0, 1, 2, ... (until stopping):

    Sample an example i{1,...,n}i \in \{1, ..., n\}

    θ(t+1)=θ(t)αJ(i)(θ(t))\theta^{(t+1)} = \theta^{(t)} - \alpha \nabla J^{(i)}(\theta^{(t)})

Where J(i)(θ)=(hθ(x(i))y(i))x(i)\nabla J^{(i)}(\theta) = (h_\theta(x^{(i)}) - y^{(i)}) x^{(i)} is the gradient for a single example.

Advantages:

  • Much faster per iteration (no sum over all data)
  • Can escape local minima due to noise
  • Works well with online learning

Disadvantages:

  • Noisy updates, doesn't converge smoothly
  • May need learning rate decay

Mini-batch Gradient Descent

Mini-batch Gradient Descent is a compromise between batch GD and SGD, using batch size ss and epoch-based shuffling:

Algorithm:

Initialize θ(0)\theta^{(0)} randomly,   j=0j = 0

For e=0,1,2,...e = 0, 1, 2, ... (until stopping)     Every epoch gets a new random order

    Shuffle all nn data points randomly

    For t=1,...,n/st = 1, ..., \lfloor n/s \rfloor:

        Collect the current batch B={x(1+(t1)s),...,x(ts)}B = \{x^{(1+(t-1) \cdot s)}, ..., x^{(t \cdot s)}\}

        θ(j+1)=θ(j)αJB(θ(j)),j=j+1\theta^{(j+1)} = \theta^{(j)} - \alpha \nabla J^B(\theta^{(j)}), \quad j = j + 1

Where JB(θ)=1siB(hθ(x(i))y(i))x(i)\nabla J^B(\theta) = \frac{1}{s} \sum_{i \in B} (h_\theta(x^{(i)}) - y^{(i)}) x^{(i)} is the gradient over the batch.

Advantages:

  • Balances speed and stability
  • Leverages vectorization for efficiency
  • Standard choice in deep learning (typical batch sizes: 32, 64, 128, 256)

Normal Equation

Analytical solution to find optimal θ\theta:

θ=(XTX)1XTy\theta = (X^T X)^{-1} X^T y

Pros: No need to choose learning rate, no iterations
Cons: Slow if nn is large (computing inverse is O(n3)O(n^3))

Ridge Regression (L2 Regularization)

Regularization helps prevent overfitting by adding a penalty term to the cost function that discourages complex models.

Add the squared magnitude of coefficients to the cost function:

J(θ)=12mi=1m(hθ(x(i))y(i))2+λ2mj=1nθj2J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 + \frac{\lambda}{2m} \sum_{j=1}^{n} \theta_j^2
  • λ\lambda is the regularization parameter
  • Larger λ\lambda means more regularization (simpler model)
  • Does not include θ0\theta_0 in the penalty term

Effect: Shrinks coefficients towards zero but doesn't set them exactly to zero.