Linear Algebra This guide covers the essential linear algebra concepts needed for machine learning, including vectors, matrices, matrix calculus, and their applications.
Basics Vector : x = [ x 1 , x 2 , . . . , x n ] T \mathbf{x} = [x_1, x_2, ..., x_n]^T x = [ x 1 , x 2 , ... , x n ] T
Matrix : A ∈ R m × n \mathbf{A} \in \mathbb{R}^{m \times n} A ∈ R m × n
A = [ a 11 a 12 ⋯ a 1 n a 21 a 22 ⋯ a 2 n ⋮ ⋮ ⋱ ⋮ a m 1 a m 2 ⋯ a m n ] \mathbf{A} = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{bmatrix} A = ⎣ ⎡ a 11 a 21 ⋮ a m 1 a 12 a 22 ⋮ a m 2 ⋯ ⋯ ⋱ ⋯ a 1 n a 2 n ⋮ a mn ⎦ ⎤ Transpose : ( A T ) i j = A j i (\mathbf{A}^T)_{ij} = \mathbf{A}_{ji} ( A T ) ij = A ji
Matrix Multiplication : C = A B \mathbf{C} = \mathbf{AB} C = AB where C i j = ∑ k A i k B k j C_{ij} = \sum_k A_{ik}B_{kj} C ij = ∑ k A ik B kj
Identity Matrix : I n \mathbf{I}_n I n where I i j = 1 I_{ij} = 1 I ij = 1 if i = j i=j i = j , else 0 0 0
Inverse : A − 1 A = A A − 1 = I \mathbf{A}^{-1}\mathbf{A} = \mathbf{AA}^{-1} = \mathbf{I} A − 1 A = AA − 1 = I
Properties :
( A B ) T = B T A T (\mathbf{AB})^T = \mathbf{B}^T\mathbf{A}^T ( AB ) T = B T A T ( A T ) T = A (\mathbf{A}^T)^T = \mathbf{A} ( A T ) T = A ( A + B ) T = A T + B T (\mathbf{A} + \mathbf{B})^T = \mathbf{A}^T + \mathbf{B}^T ( A + B ) T = A T + B T ( A B ) − 1 = B − 1 A − 1 (\mathbf{AB})^{-1} = \mathbf{B}^{-1}\mathbf{A}^{-1} ( AB ) − 1 = B − 1 A − 1 ( A T ) − 1 = ( A − 1 ) T (\mathbf{A}^T)^{-1} = (\mathbf{A}^{-1})^T ( A T ) − 1 = ( A − 1 ) T Trace and Determinant Trace : Sum of diagonal elements
tr ( A ) = ∑ i = 1 n A i i \text{tr}(\mathbf{A}) = \sum_{i=1}^n A_{ii} tr ( A ) = i = 1 ∑ n A ii Properties :
tr ( A + B ) = tr ( A ) + tr ( B ) \text{tr}(\mathbf{A} + \mathbf{B}) = \text{tr}(\mathbf{A}) + \text{tr}(\mathbf{B}) tr ( A + B ) = tr ( A ) + tr ( B ) tr ( A B ) = tr ( B A ) \text{tr}(\mathbf{AB}) = \text{tr}(\mathbf{BA}) tr ( AB ) = tr ( BA ) tr ( A T ) = tr ( A ) \text{tr}(\mathbf{A}^T) = \text{tr}(\mathbf{A}) tr ( A T ) = tr ( A ) Determinant : det ( A ) \det(\mathbf{A}) det ( A ) or ∣ A ∣ |\mathbf{A}| ∣ A ∣
For 2×2 matrix:
det [ a b c d ] = a d − b c \det\begin{bmatrix} a & b \\ c & d \end{bmatrix} = ad - bc det [ a c b d ] = a d − b c Eigenvalues and Eigenvectors For a square matrix A \mathbf{A} A :
A v = λ v \mathbf{Av} = \lambda \mathbf{v} Av = λ v where λ \lambda λ is the eigenvalue and v \mathbf{v} v is the eigenvector.
Properties :
det ( A ) = ∏ i λ i \det(\mathbf{A}) = \prod_i \lambda_i det ( A ) = ∏ i λ i tr ( A ) = ∑ i λ i \text{tr}(\mathbf{A}) = \sum_i \lambda_i tr ( A ) = ∑ i λ i Matrix Calculus Matrix-Vector Product Derivatives :
∂ A x ∂ A = x T \frac{\partial \mathbf{Ax}}{\partial \mathbf{A}} = \mathbf{x}^T ∂ A ∂ Ax = x T ∂ A x ∂ x = A \frac{\partial \mathbf{Ax}}{\partial \mathbf{x}} = \mathbf{A} ∂ x ∂ Ax = A Matrix Product Derivatives :
∂ A B ∂ A = B T \frac{\partial \mathbf{AB}}{\partial \mathbf{A}} = \mathbf{B}^T ∂ A ∂ AB = B T ∂ A B ∂ B = A T \frac{\partial \mathbf{AB}}{\partial \mathbf{B}} = \mathbf{A}^T ∂ B ∂ AB = A T Dot Product Derivatives :
For z = y T x z = \mathbf{y}^T\mathbf{x} z = y T x :
∂ z ∂ x = y , ∂ z ∂ y = x \frac{\partial z}{\partial \mathbf{x}} = \mathbf{y}, \quad \frac{\partial z}{\partial \mathbf{y}} = \mathbf{x} ∂ x ∂ z = y , ∂ y ∂ z = x Quadratic Form (very important!):
For f ( x ) = x T A x f(\mathbf{x}) = \mathbf{x}^T\mathbf{Ax} f ( x ) = x T Ax :
∂ ( x T A x ) ∂ x = ( A + A T ) x \frac{\partial (\mathbf{x}^T\mathbf{Ax})}{\partial \mathbf{x}} = (\mathbf{A} + \mathbf{A}^T)\mathbf{x} ∂ x ∂ ( x T Ax ) = ( A + A T ) x If A \mathbf{A} A is symmetric:
∂ ( x T A x ) ∂ x = 2 A x \frac{\partial (\mathbf{x}^T\mathbf{Ax})}{\partial \mathbf{x}} = 2\mathbf{Ax} ∂ x ∂ ( x T Ax ) = 2 Ax Linear Form :
∂ ( a T x ) ∂ x = a \frac{\partial (\mathbf{a}^T\mathbf{x})}{\partial \mathbf{x}} = \mathbf{a} ∂ x ∂ ( a T x ) = a ∂ ( x T a ) ∂ x = a \frac{\partial (\mathbf{x}^T\mathbf{a})}{\partial \mathbf{x}} = \mathbf{a} ∂ x ∂ ( x T a ) = a Norm Squared :
∂ ( x T x ) ∂ x = 2 x \frac{\partial (\mathbf{x}^T\mathbf{x})}{\partial \mathbf{x}} = 2\mathbf{x} ∂ x ∂ ( x T x ) = 2 x ∂ ∥ x ∥ 2 ∂ x = 2 x \frac{\partial \|\mathbf{x}\|^2}{\partial \mathbf{x}} = 2\mathbf{x} ∂ x ∂ ∥ x ∥ 2 = 2 x Common ML Applications Linear Regression Loss :
L ( w ) = ∥ X w − y ∥ 2 = ( X w − y ) T ( X w − y ) L(\mathbf{w}) = \|\mathbf{Xw} - \mathbf{y}\|^2 = (\mathbf{Xw} - \mathbf{y})^T(\mathbf{Xw} - \mathbf{y}) L ( w ) = ∥ Xw − y ∥ 2 = ( Xw − y ) T ( Xw − y ) ∂ L ∂ w = 2 X T ( X w − y ) \frac{\partial L}{\partial \mathbf{w}} = 2\mathbf{X}^T(\mathbf{Xw} - \mathbf{y}) ∂ w ∂ L = 2 X T ( Xw − y ) Setting to zero gives the Normal Equation :
w = ( X T X ) − 1 X T y \mathbf{w} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} w = ( X T X ) − 1 X T y