Linear Regression What is Linear Regression Palo Alto Housing Prices: 3D Linear Regression
Linear Regression is a supervised learning algorithm, which assumes a linear relationship between features (X) and a target (y). We can approximate our target y, as a linear function of x, such that:
h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 2 + . . . + θ d x d = ∑ i = 0 d θ i x i = θ T x h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + ... + \theta_d x_d = \sum_{i=0}^{d} \theta_i x_i = \theta^T x h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 2 + ... + θ d x d = i = 0 ∑ d θ i x i = θ T x with parameters (or weights) θ i \theta_i θ i 's
Then, we minimize the least squares objective:
J ( θ ) = ∑ i = 1 n ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta) = \sum_{i=1}^{n} \left(h_\theta(x^{(i)}) - y^{(i)}\right)^2 J ( θ ) = i = 1 ∑ n ( h θ ( x ( i ) ) − y ( i ) ) 2 Consider y ( i ) = θ T x ( i ) + ε ( i ) y^{(i)} = \theta^T x^{(i)} + \varepsilon^{(i)} y ( i ) = θ T x ( i ) + ε ( i ) where ε ( i ) ∼ N ( 0 , σ 2 ) \varepsilon^{(i)} \sim \mathcal{N}(0, \sigma^2) ε ( i ) ∼ N ( 0 , σ 2 ) independently and identically distributed (iid).
N ( μ , σ 2 ) \mathcal{N}(\mu, \sigma^2) N ( μ , σ 2 ) is the Normal (Gaussian) distribution with probability density function:
f ( x ) = 1 2 π σ 2 e − ( x − μ ) 2 2 σ 2 f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} f ( x ) = 2 π σ 2 1 e − 2 σ 2 ( x − μ ) 2 Maximum Likelihood Estimation (MLE)
L ( θ ) = ∏ i = 1 n P ( y ( i ) ; x ( i ) , θ ) L(\theta) = \prod_{i=1}^{n} P(y^{(i)}; x^{(i)}, \theta) L ( θ ) = i = 1 ∏ n P ( y ( i ) ; x ( i ) , θ ) = ∏ i = 1 n 1 2 π σ exp ( − ( y ( i ) − θ T x ( i ) ) 2 2 σ 2 ) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma}} \exp\left(-\frac{(y^{(i)} - \theta^T x^{(i)})^2}{2\sigma^2}\right) = i = 1 ∏ n 2 πσ 1 exp ( − 2 σ 2 ( y ( i ) − θ T x ( i ) ) 2 ) Log-Likelihood (easier to work with):
ℓ ( θ ) = log L ( θ ) \ell(\theta) = \log L(\theta) ℓ ( θ ) = log L ( θ ) = log ∏ i = 1 n 1 2 π σ exp ( − ( y ( i ) − θ T x ( i ) ) 2 2 σ 2 ) = \log \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma}} \exp\left(-\frac{(y^{(i)} - \theta^T x^{(i)})^2}{2\sigma^2}\right) = log i = 1 ∏ n 2 πσ 1 exp ( − 2 σ 2 ( y ( i ) − θ T x ( i ) ) 2 ) = ∑ i = 1 n log 1 2 π σ exp ( − ( y ( i ) − θ T x ( i ) ) 2 2 σ 2 ) = \sum_{i=1}^{n} \log \frac{1}{\sqrt{2\pi\sigma}} \exp\left(-\frac{(y^{(i)} - \theta^T x^{(i)})^2}{2\sigma^2}\right) = i = 1 ∑ n log 2 πσ 1 exp ( − 2 σ 2 ( y ( i ) − θ T x ( i ) ) 2 ) = n log 1 2 π σ − 1 σ 2 ⋅ 1 2 ∑ i = 1 n ( y ( i ) − θ T x ( i ) ) 2 = n \log \frac{1}{\sqrt{2\pi\sigma}} - \frac{1}{\sigma^2} \cdot \frac{1}{2} \sum_{i=1}^{n} (y^{(i)} - \theta^T x^{(i)})^2 = n log 2 πσ 1 − σ 2 1 ⋅ 2 1 i = 1 ∑ n ( y ( i ) − θ T x ( i ) ) 2 ℓ ( θ ) = min θ ∑ i = 1 n ( y ( i ) − θ T x ( i ) ) 2 \ell(\theta) = \min_{\theta} \sum_{i=1}^{n} (y^{(i)} - \theta^T x^{(i)})^2 ℓ ( θ ) = θ min i = 1 ∑ n ( y ( i ) − θ T x ( i ) ) 2 ∑ i = 1 n ( y ( i ) − θ T x ( i ) ) 2 \sum_{i=1}^{n} (y^{(i)} - \theta^T x^{(i)})^2 ∑ i = 1 n ( y ( i ) − θ T x ( i ) ) 2 is also called Sum of Squared Errors (SSE) .
Cost Function The Mean Squared Error (MSE) cost function:
J ( θ ) = 1 2 n ∑ i = 1 n ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta) = \frac{1}{2n} \sum_{i=1}^{n} (h_\theta(x^{(i)}) - y^{(i)})^2 J ( θ ) = 2 n 1 i = 1 ∑ n ( h θ ( x ( i ) ) − y ( i ) ) 2 The factor 1 / 2 n 1/2n 1/2 n can be added to the cost function because it's a constant with respect to θ \theta θ in arg min θ \arg\min_\theta arg min θ .
It doesn't affect the location of the minimum, but makes derivatives easier (cleaner factors) when optimizing.
∇ J ( θ ) = 1 n ∑ i = 1 n ( h θ ( x ( i ) ) − y ( i ) ) x ( i ) \nabla J(\theta) = \frac{1}{n} \sum_{i=1}^{n} (h_\theta(x^{(i)}) - y^{(i)}) x^{(i)} ∇ J ( θ ) = n 1 i = 1 ∑ n ( h θ ( x ( i ) ) − y ( i ) ) x ( i ) Gradient Descent Gradient Descent is an iterative optimization algorithm to find the minimum of the cost function.
Algorithm:
Initialize θ ( 0 ) \theta^{(0)} θ ( 0 ) randomly (or to zeros)
For t = 0 , 1 , 2 , . . . t = 0, 1, 2, ... t = 0 , 1 , 2 , ... (until convergence):
θ ( t + 1 ) = θ ( t ) − α ∇ J ( θ ( t ) ) \theta^{(t+1)} = \theta^{(t)} - \alpha \nabla J(\theta^{(t)}) θ ( t + 1 ) = θ ( t ) − α ∇ J ( θ ( t ) ) Key Parameters:
α \alpha α is the learning rate (step size)Too small → slow convergence (need more iterations) Too large → may overshoot and diverge t t t is the iteration number The gradient ∇ J ( θ ) \nabla J(\theta) ∇ J ( θ ) points in the direction of steepest ascent, so we subtract it to descend Stochastic Gradient Descent Instead of computing the gradient over all n n n examples, Stochastic Gradient Descent (SGD) uses one random example at a time:
Algorithm:
Initialize θ ( 0 ) \theta^{(0)} θ ( 0 ) randomly
For t = 0 , 1 , 2 , . . . t = 0, 1, 2, ... t = 0 , 1 , 2 , ... (until stopping):
Sample an example i ∈ { 1 , . . . , n } i \in \{1, ..., n\} i ∈ { 1 , ... , n }
θ ( t + 1 ) = θ ( t ) − α ∇ J ( i ) ( θ ( t ) ) \theta^{(t+1)} = \theta^{(t)} - \alpha \nabla J^{(i)}(\theta^{(t)}) θ ( t + 1 ) = θ ( t ) − α ∇ J ( i ) ( θ ( t ) )
Where ∇ J ( i ) ( θ ) = ( h θ ( x ( i ) ) − y ( i ) ) x ( i ) \nabla J^{(i)}(\theta) = (h_\theta(x^{(i)}) - y^{(i)}) x^{(i)} ∇ J ( i ) ( θ ) = ( h θ ( x ( i ) ) − y ( i ) ) x ( i ) is the gradient for a single example.
Advantages:
Much faster per iteration (no sum over all data) Can escape local minima due to noise Works well with online learning Disadvantages:
Noisy updates, doesn't converge smoothly May need learning rate decay Mini-batch Gradient Descent Mini-batch Gradient Descent is a compromise between batch GD and SGD, using batch size s s s and epoch-based shuffling:
Algorithm:
Initialize θ ( 0 ) \theta^{(0)} θ ( 0 ) randomly, j = 0 j = 0 j = 0
For e = 0 , 1 , 2 , . . . e = 0, 1, 2, ... e = 0 , 1 , 2 , ... (until stopping) ← Every epoch gets a new random order
Shuffle all n n n data points randomly
For t = 1 , . . . , ⌊ n / s ⌋ t = 1, ..., \lfloor n/s \rfloor t = 1 , ... , ⌊ n / s ⌋ :
Collect the current batch B = { x ( 1 + ( t − 1 ) ⋅ s ) , . . . , x ( t ⋅ s ) } B = \{x^{(1+(t-1) \cdot s)}, ..., x^{(t \cdot s)}\} B = { x ( 1 + ( t − 1 ) ⋅ s ) , ... , x ( t ⋅ s ) }
θ ( j + 1 ) = θ ( j ) − α ∇ J B ( θ ( j ) ) , j = j + 1 \theta^{(j+1)} = \theta^{(j)} - \alpha \nabla J^B(\theta^{(j)}), \quad j = j + 1 θ ( j + 1 ) = θ ( j ) − α ∇ J B ( θ ( j ) ) , j = j + 1
Where ∇ J B ( θ ) = 1 s ∑ i ∈ B ( h θ ( x ( i ) ) − y ( i ) ) x ( i ) \nabla J^B(\theta) = \frac{1}{s} \sum_{i \in B} (h_\theta(x^{(i)}) - y^{(i)}) x^{(i)} ∇ J B ( θ ) = s 1 ∑ i ∈ B ( h θ ( x ( i ) ) − y ( i ) ) x ( i ) is the gradient over the batch.
Advantages:
Balances speed and stability Leverages vectorization for efficiency Standard choice in deep learning (typical batch sizes: 32, 64, 128, 256) Normal Equation Analytical solution to find optimal θ \theta θ :
θ = ( X T X ) − 1 X T y \theta = (X^T X)^{-1} X^T y θ = ( X T X ) − 1 X T y Pros : No need to choose learning rate, no iterations
Cons : Slow if n n n is large (computing inverse is O ( n 3 ) O(n^3) O ( n 3 ) )
Ridge Regression (L2 Regularization) Regularization helps prevent overfitting by adding a penalty term to the cost function that discourages complex models.
Add the squared magnitude of coefficients to the cost function:
J ( θ ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 + λ 2 m ∑ j = 1 n θ j 2 J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 + \frac{\lambda}{2m} \sum_{j=1}^{n} \theta_j^2 J ( θ ) = 2 m 1 i = 1 ∑ m ( h θ ( x ( i ) ) − y ( i ) ) 2 + 2 m λ j = 1 ∑ n θ j 2 λ \lambda λ is the regularization parameterLarger λ \lambda λ means more regularization (simpler model) Does not include θ 0 \theta_0 θ 0 in the penalty term Effect : Shrinks coefficients towards zero but doesn't set them exactly to zero.