Skip to main content

Linear Regression

Palo Alto Housing Prices: 3D Linear Regression
0.01 (Slow)0.15 (Fast)
0 (None)200 (Converged)
Intercept (θ₀)
$0.000M
Distance (θ₁)
0.000
Room Size (θ₂)
0.000
MSE
0.0000

What is Linear Regression

Linear Regression is a supervised learning algorithm, which assumes a linear relationship between features (X) and a target (y). We can approximate our target y, as a linear function of x, such that:

hθ(x)=θ0+θ1x1+θ2x2+...+θdxd=i=0dθixi=θTxh_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + ... + \theta_d x_d = \sum_{i=0}^{d} \theta_i x_i = \theta^T x

with parameters (or weights) θi\theta_i's

Then, we minimize the least squares objective:

J(θ)=i=1n(hθ(x(i))y(i))2J(\theta) = \sum_{i=1}^{n} \left(h_\theta(x^{(i)}) - y^{(i)}\right)^2

Consider y(i)=θTx(i)+ε(i)y^{(i)} = \theta^T x^{(i)} + \varepsilon^{(i)} where ε(i)N(0,σ2)\varepsilon^{(i)} \sim \mathcal{N}(0, \sigma^2) independently and identically distributed (iid).

N(μ,σ2)\mathcal{N}(\mu, \sigma^2) is the Normal (Gaussian) distribution with probability density function:

f(x)=12πσ2e(xμ)22σ2f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}

Maximum Likelihood Estimation (MLE)

L(θ)=i=1nP(y(i);x(i),θ)L(\theta) = \prod_{i=1}^{n} P(y^{(i)}; x^{(i)}, \theta)
=i=1n12πσexp((y(i)θTx(i))22σ2)= \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma}} \exp\left(-\frac{(y^{(i)} - \theta^T x^{(i)})^2}{2\sigma^2}\right)

Log-Likelihood (easier to work with):

(θ)=logL(θ)\ell(\theta) = \log L(\theta)
=logi=1n12πσexp((y(i)θTx(i))22σ2)= \log \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma}} \exp\left(-\frac{(y^{(i)} - \theta^T x^{(i)})^2}{2\sigma^2}\right)
=i=1nlog12πσexp((y(i)θTx(i))22σ2)= \sum_{i=1}^{n} \log \frac{1}{\sqrt{2\pi\sigma}} \exp\left(-\frac{(y^{(i)} - \theta^T x^{(i)})^2}{2\sigma^2}\right)
=nlog12πσ1σ212i=1n(y(i)θTx(i))2= n \log \frac{1}{\sqrt{2\pi\sigma}} - \frac{1}{\sigma^2} \cdot \frac{1}{2} \sum_{i=1}^{n} (y^{(i)} - \theta^T x^{(i)})^2
(θ)=minθi=1n(y(i)θTx(i))2\ell(\theta) = \min_{\theta} \sum_{i=1}^{n} (y^{(i)} - \theta^T x^{(i)})^2
tip

i=1n(y(i)θTx(i))2\sum_{i=1}^{n} (y^{(i)} - \theta^T x^{(i)})^2 is also called Sum of Squared Errors (SSE).

Cost Function

The Mean Squared Error (MSE) cost function:

J(θ)=12ni=1n(hθ(x(i))y(i))2J(\theta) = \frac{1}{2n} \sum_{i=1}^{n} (h_\theta(x^{(i)}) - y^{(i)})^2
note

The factor 1/2n1/2n can be added to the cost function because it's a constant with respect to θ\theta in argminθ\arg\min_\theta. It doesn't affect the location of the minimum, but makes derivatives easier (cleaner factors) when optimizing.

J(θ)=1ni=1n(hθ(x(i))y(i))x(i)\nabla J(\theta) = \frac{1}{n} \sum_{i=1}^{n} (h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}
Optimization Methods

To minimize this cost function, you can use Gradient Descent algorithms (Batch GD, SGD, Mini-batch GD) or the analytical Normal Equation. See the Gradient Descent page for details.

Normal Equation

Analytical solution to find optimal θ\theta:

θ=(XTX)1XTy\theta = (X^T X)^{-1} X^T y

Pros: No need to choose learning rate, no iterations
Cons: Slow if nn is large (computing inverse is O(n3)O(n^3))

Matrix Calculus Derivation

Using matrix notation, the Linear Regression Loss can be written as:

L(w)=Xwy2=(Xwy)T(Xwy)L(\mathbf{w}) = \|\mathbf{Xw} - \mathbf{y}\|^2 = (\mathbf{Xw} - \mathbf{y})^T(\mathbf{Xw} - \mathbf{y})

Taking the gradient with respect to w\mathbf{w}:

Lw=2XT(Xwy)\frac{\partial L}{\partial \mathbf{w}} = 2\mathbf{X}^T(\mathbf{Xw} - \mathbf{y})

Setting the gradient to zero to find the minimum:

2XT(Xwy)=02\mathbf{X}^T(\mathbf{Xw} - \mathbf{y}) = 0
XTXw=XTy\mathbf{X}^T\mathbf{Xw} = \mathbf{X}^T\mathbf{y}

Solving for w\mathbf{w} gives the Normal Equation:

w=(XTX)1XTy\mathbf{w} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}

Ridge Regression (L2 Regularization)

Regularization helps prevent overfitting by adding a penalty term to the cost function that discourages complex models.

Add the squared magnitude of coefficients to the cost function:

J(θ)=12mi=1m(hθ(x(i))y(i))2+λ2mj=1nθj2J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 + \frac{\lambda}{2m} \sum_{j=1}^{n} \theta_j^2
  • λ\lambda is the regularization parameter
  • Larger λ\lambda means more regularization (simpler model)
  • Does not include θ0\theta_0 in the penalty term

Effect: Shrinks coefficients towards zero but doesn't set them exactly to zero.