Linear Regression Palo Alto Housing Prices: 3D Linear Regression
What is Linear Regression Linear Regression is a supervised learning algorithm, which assumes a linear relationship between features (X) and a target (y). We can approximate our target y, as a linear function of x, such that:
h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 2 + . . . + θ d x d = ∑ i = 0 d θ i x i = θ T x h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + ... + \theta_d x_d = \sum_{i=0}^{d} \theta_i x_i = \theta^T x h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 2 + ... + θ d x d = i = 0 ∑ d θ i x i = θ T x with parameters (or weights) θ i \theta_i θ i 's
Then, we minimize the least squares objective:
J ( θ ) = ∑ i = 1 n ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta) = \sum_{i=1}^{n} \left(h_\theta(x^{(i)}) - y^{(i)}\right)^2 J ( θ ) = i = 1 ∑ n ( h θ ( x ( i ) ) − y ( i ) ) 2 Consider y ( i ) = θ T x ( i ) + ε ( i ) y^{(i)} = \theta^T x^{(i)} + \varepsilon^{(i)} y ( i ) = θ T x ( i ) + ε ( i ) where ε ( i ) ∼ N ( 0 , σ 2 ) \varepsilon^{(i)} \sim \mathcal{N}(0, \sigma^2) ε ( i ) ∼ N ( 0 , σ 2 ) independently and identically distributed (iid).
N ( μ , σ 2 ) \mathcal{N}(\mu, \sigma^2) N ( μ , σ 2 ) is the Normal (Gaussian) distribution with probability density function:
f ( x ) = 1 2 π σ 2 e − ( x − μ ) 2 2 σ 2 f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} f ( x ) = 2 π σ 2 1 e − 2 σ 2 ( x − μ ) 2 Maximum Likelihood Estimation (MLE)
L ( θ ) = ∏ i = 1 n P ( y ( i ) ; x ( i ) , θ ) L(\theta) = \prod_{i=1}^{n} P(y^{(i)}; x^{(i)}, \theta) L ( θ ) = i = 1 ∏ n P ( y ( i ) ; x ( i ) , θ ) = ∏ i = 1 n 1 2 π σ exp ( − ( y ( i ) − θ T x ( i ) ) 2 2 σ 2 ) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma}} \exp\left(-\frac{(y^{(i)} - \theta^T x^{(i)})^2}{2\sigma^2}\right) = i = 1 ∏ n 2 πσ 1 exp ( − 2 σ 2 ( y ( i ) − θ T x ( i ) ) 2 ) Log-Likelihood (easier to work with):
ℓ ( θ ) = log L ( θ ) \ell(\theta) = \log L(\theta) ℓ ( θ ) = log L ( θ ) = log ∏ i = 1 n 1 2 π σ exp ( − ( y ( i ) − θ T x ( i ) ) 2 2 σ 2 ) = \log \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma}} \exp\left(-\frac{(y^{(i)} - \theta^T x^{(i)})^2}{2\sigma^2}\right) = log i = 1 ∏ n 2 πσ 1 exp ( − 2 σ 2 ( y ( i ) − θ T x ( i ) ) 2 ) = ∑ i = 1 n log 1 2 π σ exp ( − ( y ( i ) − θ T x ( i ) ) 2 2 σ 2 ) = \sum_{i=1}^{n} \log \frac{1}{\sqrt{2\pi\sigma}} \exp\left(-\frac{(y^{(i)} - \theta^T x^{(i)})^2}{2\sigma^2}\right) = i = 1 ∑ n log 2 πσ 1 exp ( − 2 σ 2 ( y ( i ) − θ T x ( i ) ) 2 ) = n log 1 2 π σ − 1 σ 2 ⋅ 1 2 ∑ i = 1 n ( y ( i ) − θ T x ( i ) ) 2 = n \log \frac{1}{\sqrt{2\pi\sigma}} - \frac{1}{\sigma^2} \cdot \frac{1}{2} \sum_{i=1}^{n} (y^{(i)} - \theta^T x^{(i)})^2 = n log 2 πσ 1 − σ 2 1 ⋅ 2 1 i = 1 ∑ n ( y ( i ) − θ T x ( i ) ) 2 ℓ ( θ ) = min θ ∑ i = 1 n ( y ( i ) − θ T x ( i ) ) 2 \ell(\theta) = \min_{\theta} \sum_{i=1}^{n} (y^{(i)} - \theta^T x^{(i)})^2 ℓ ( θ ) = θ min i = 1 ∑ n ( y ( i ) − θ T x ( i ) ) 2 ∑ i = 1 n ( y ( i ) − θ T x ( i ) ) 2 \sum_{i=1}^{n} (y^{(i)} - \theta^T x^{(i)})^2 ∑ i = 1 n ( y ( i ) − θ T x ( i ) ) 2 is also called Sum of Squared Errors (SSE) .
Cost Function The Mean Squared Error (MSE) cost function:
J ( θ ) = 1 2 n ∑ i = 1 n ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta) = \frac{1}{2n} \sum_{i=1}^{n} (h_\theta(x^{(i)}) - y^{(i)})^2 J ( θ ) = 2 n 1 i = 1 ∑ n ( h θ ( x ( i ) ) − y ( i ) ) 2 The factor 1 / 2 n 1/2n 1/2 n can be added to the cost function because it's a constant with respect to θ \theta θ in arg min θ \arg\min_\theta arg min θ .
It doesn't affect the location of the minimum, but makes derivatives easier (cleaner factors) when optimizing.
∇ J ( θ ) = 1 n ∑ i = 1 n ( h θ ( x ( i ) ) − y ( i ) ) x ( i ) \nabla J(\theta) = \frac{1}{n} \sum_{i=1}^{n} (h_\theta(x^{(i)}) - y^{(i)}) x^{(i)} ∇ J ( θ ) = n 1 i = 1 ∑ n ( h θ ( x ( i ) ) − y ( i ) ) x ( i ) To minimize this cost function, you can use Gradient Descent algorithms (Batch GD, SGD, Mini-batch GD) or the analytical Normal Equation . See the Gradient Descent page for details.
Normal Equation Analytical solution to find optimal θ \theta θ :
θ = ( X T X ) − 1 X T y \theta = (X^T X)^{-1} X^T y θ = ( X T X ) − 1 X T y Pros : No need to choose learning rate, no iterations
Cons : Slow if n n n is large (computing inverse is O ( n 3 ) O(n^3) O ( n 3 ) )
Matrix Calculus Derivation Using matrix notation, the Linear Regression Loss can be written as:
L ( w ) = ∥ X w − y ∥ 2 = ( X w − y ) T ( X w − y ) L(\mathbf{w}) = \|\mathbf{Xw} - \mathbf{y}\|^2 = (\mathbf{Xw} - \mathbf{y})^T(\mathbf{Xw} - \mathbf{y}) L ( w ) = ∥ Xw − y ∥ 2 = ( Xw − y ) T ( Xw − y ) Taking the gradient with respect to w \mathbf{w} w :
∂ L ∂ w = 2 X T ( X w − y ) \frac{\partial L}{\partial \mathbf{w}} = 2\mathbf{X}^T(\mathbf{Xw} - \mathbf{y}) ∂ w ∂ L = 2 X T ( Xw − y ) Setting the gradient to zero to find the minimum:
2 X T ( X w − y ) = 0 2\mathbf{X}^T(\mathbf{Xw} - \mathbf{y}) = 0 2 X T ( Xw − y ) = 0 X T X w = X T y \mathbf{X}^T\mathbf{Xw} = \mathbf{X}^T\mathbf{y} X T Xw = X T y Solving for w \mathbf{w} w gives the Normal Equation :
w = ( X T X ) − 1 X T y \mathbf{w} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} w = ( X T X ) − 1 X T y Ridge Regression (L2 Regularization) Regularization helps prevent overfitting by adding a penalty term to the cost function that discourages complex models.
Add the squared magnitude of coefficients to the cost function:
J ( θ ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 + λ 2 m ∑ j = 1 n θ j 2 J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 + \frac{\lambda}{2m} \sum_{j=1}^{n} \theta_j^2 J ( θ ) = 2 m 1 i = 1 ∑ m ( h θ ( x ( i ) ) − y ( i ) ) 2 + 2 m λ j = 1 ∑ n θ j 2 λ \lambda λ is the regularization parameterLarger λ \lambda λ means more regularization (simpler model) Does not include θ 0 \theta_0 θ 0 in the penalty term Effect : Shrinks coefficients towards zero but doesn't set them exactly to zero.