Linear Algebra This guide covers the essential linear algebra concepts needed for machine learning, including vectors, matrices, matrix calculus, and their applications.
Basics Vector : x = [ x 1 , x 2 , . . . , x n ] T \mathbf{x} = [x_1, x_2, ..., x_n]^T x = [ x 1 , x 2 , ... , x n ] T
Matrix : A ∈ R m × n \mathbf{A} \in \mathbb{R}^{m \times n} A ∈ R m × n
A = [ a 11 a 12 ⋯ a 1 n a 21 a 22 ⋯ a 2 n ⋮ ⋮ ⋱ ⋮ a m 1 a m 2 ⋯ a m n ] \mathbf{A} = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{bmatrix} A = ⎣ ⎡ a 11 a 21 ⋮ a m 1 a 12 a 22 ⋮ a m 2 ⋯ ⋯ ⋱ ⋯ a 1 n a 2 n ⋮ a mn ⎦ ⎤ Transpose : ( A T ) i j = A j i (\mathbf{A}^T)_{ij} = \mathbf{A}_{ji} ( A T ) ij = A ji
Matrix Multiplication : C = A B \mathbf{C} = \mathbf{AB} C = AB where C i j = ∑ k A i k B k j C_{ij} = \sum_k A_{ik}B_{kj} C ij = ∑ k A ik B kj
Identity Matrix : I n \mathbf{I}_n I n where I i j = 1 I_{ij} = 1 I ij = 1 if i = j i=j i = j , else 0 0 0
Inverse : A − 1 A = A A − 1 = I \mathbf{A}^{-1}\mathbf{A} = \mathbf{AA}^{-1} = \mathbf{I} A − 1 A = AA − 1 = I
Solving Matrix Equations : For A x = b \mathbf{A}\mathbf{x} = \mathbf{b} Ax = b where A \mathbf{A} A is invertible:
x = A − 1 b \mathbf{x} = \mathbf{A}^{-1}\mathbf{b} x = A − 1 b This is used to isolate variables in matrix equations (e.g., in normal equations for linear regression).
Properties :
( A B ) T = B T A T (\mathbf{AB})^T = \mathbf{B}^T\mathbf{A}^T ( AB ) T = B T A T ( A T ) T = A (\mathbf{A}^T)^T = \mathbf{A} ( A T ) T = A ( A + B ) T = A T + B T (\mathbf{A} + \mathbf{B})^T = \mathbf{A}^T + \mathbf{B}^T ( A + B ) T = A T + B T ( A B ) − 1 = B − 1 A − 1 (\mathbf{AB})^{-1} = \mathbf{B}^{-1}\mathbf{A}^{-1} ( AB ) − 1 = B − 1 A − 1 ( A T ) − 1 = ( A − 1 ) T (\mathbf{A}^T)^{-1} = (\mathbf{A}^{-1})^T ( A T ) − 1 = ( A − 1 ) T Block Matrix Multiplication : When matrices are partitioned into blocks, multiplication follows the same rules:
For A = [ A 1 A 2 ] \mathbf{A} = [\mathbf{A}_1 \; \mathbf{A}_2] A = [ A 1 A 2 ] (horizontal partition) and B = [ B 1 B 2 ] \mathbf{B} = \begin{bmatrix} \mathbf{B}_1 \\ \mathbf{B}_2 \end{bmatrix} B = [ B 1 B 2 ] (vertical partition):
A B = [ A 1 A 2 ] [ B 1 B 2 ] = A 1 B 1 + A 2 B 2 \mathbf{AB} = [\mathbf{A}_1 \; \mathbf{A}_2] \begin{bmatrix} \mathbf{B}_1 \\ \mathbf{B}_2 \end{bmatrix} = \mathbf{A}_1\mathbf{B}_1 + \mathbf{A}_2\mathbf{B}_2 AB = [ A 1 A 2 ] [ B 1 B 2 ] = A 1 B 1 + A 2 B 2 For partitioned matrices [ A 11 A 12 A 21 A 22 ] \begin{bmatrix} \mathbf{A}_{11} & \mathbf{A}_{12} \\ \mathbf{A}_{21} & \mathbf{A}_{22} \end{bmatrix} [ A 11 A 21 A 12 A 22 ] and [ B 11 B 12 B 21 B 22 ] \begin{bmatrix} \mathbf{B}_{11} & \mathbf{B}_{12} \\ \mathbf{B}_{21} & \mathbf{B}_{22} \end{bmatrix} [ B 11 B 21 B 12 B 22 ] :
[ A 11 A 12 A 21 A 22 ] [ B 11 B 12 B 21 B 22 ] = [ A 11 B 11 + A 12 B 21 A 11 B 12 + A 12 B 22 A 21 B 11 + A 22 B 21 A 21 B 12 + A 22 B 22 ] \begin{bmatrix} \mathbf{A}_{11} & \mathbf{A}_{12} \\ \mathbf{A}_{21} & \mathbf{A}_{22} \end{bmatrix} \begin{bmatrix} \mathbf{B}_{11} & \mathbf{B}_{12} \\ \mathbf{B}_{21} & \mathbf{B}_{22} \end{bmatrix} = \begin{bmatrix} \mathbf{A}_{11}\mathbf{B}_{11} + \mathbf{A}_{12}\mathbf{B}_{21} & \mathbf{A}_{11}\mathbf{B}_{12} + \mathbf{A}_{12}\mathbf{B}_{22} \\ \mathbf{A}_{21}\mathbf{B}_{11} + \mathbf{A}_{22}\mathbf{B}_{21} & \mathbf{A}_{21}\mathbf{B}_{12} + \mathbf{A}_{22}\mathbf{B}_{22} \end{bmatrix} [ A 11 A 21 A 12 A 22 ] [ B 11 B 21 B 12 B 22 ] = [ A 11 B 11 + A 12 B 21 A 21 B 11 + A 22 B 21 A 11 B 12 + A 12 B 22 A 21 B 12 + A 22 B 22 ] Key Pattern : Block multiplication works like regular matrix multiplication, treating blocks as scalars.
Trace and Determinant Trace : Sum of diagonal elements
tr ( A ) = ∑ i = 1 n A i i \text{tr}(\mathbf{A}) = \sum_{i=1}^n A_{ii} tr ( A ) = i = 1 ∑ n A ii Properties :
tr ( A + B ) = tr ( A ) + tr ( B ) \text{tr}(\mathbf{A} + \mathbf{B}) = \text{tr}(\mathbf{A}) + \text{tr}(\mathbf{B}) tr ( A + B ) = tr ( A ) + tr ( B ) tr ( A B ) = tr ( B A ) \text{tr}(\mathbf{AB}) = \text{tr}(\mathbf{BA}) tr ( AB ) = tr ( BA ) tr ( A T ) = tr ( A ) \text{tr}(\mathbf{A}^T) = \text{tr}(\mathbf{A}) tr ( A T ) = tr ( A ) Determinant : det ( A ) \det(\mathbf{A}) det ( A ) or ∣ A ∣ |\mathbf{A}| ∣ A ∣
For 2×2 matrix:
det [ a b c d ] = a d − b c \det\begin{bmatrix} a & b \\ c & d \end{bmatrix} = ad - bc det [ a c b d ] = a d − b c Eigenvalues and Eigenvectors For a square matrix A \mathbf{A} A :
A v = λ v \mathbf{Av} = \lambda \mathbf{v} Av = λ v where λ \lambda λ is the eigenvalue and v \mathbf{v} v is the eigenvector.
Diagonalizable Matrix : A matrix A \mathbf{A} A is diagonalizable if it can be written as:
A = P D P − 1 \mathbf{A} = \mathbf{PDP}^{-1} A = PDP − 1 where:
D \mathbf{D} D is a diagonal matrix containing the eigenvalues of A \mathbf{A} A P \mathbf{P} P is an invertible matrix whose columns are the eigenvectors of A \mathbf{A} A (corresponding to the eigenvalues in D \mathbf{D} D )Conditions for Diagonalizability :
A \mathbf{A} A has n n n linearly independent eigenvectors (where n n n is the dimension of A \mathbf{A} A )If A \mathbf{A} A is symmetric, it is always diagonalizable (Spectral Theorem) Properties :
det ( A ) = ∏ i λ i \det(\mathbf{A}) = \prod_i \lambda_i det ( A ) = ∏ i λ i tr ( A ) = ∑ i λ i \text{tr}(\mathbf{A}) = \sum_i \lambda_i tr ( A ) = ∑ i λ i Norms Norm — A norm is a function N : V ⟶ [ 0 , + ∞ [ N : V \longrightarrow [0, +\infty[ N : V ⟶ [ 0 , + ∞ [ where V V V is a vector space, and such that for all x , y ∈ V \boldsymbol{x}, \boldsymbol{y} \in V x , y ∈ V , we have:
N ( x + y ) ≤ N ( x ) + N ( y ) N(\boldsymbol{x} + \boldsymbol{y}) \leq N(\boldsymbol{x}) + N(\boldsymbol{y}) N ( x + y ) ≤ N ( x ) + N ( y ) (Triangle inequality)N ( a x ) = ∣ a ∣ N ( x ) N(a\boldsymbol{x}) = |a|N(\boldsymbol{x}) N ( a x ) = ∣ a ∣ N ( x ) for a a a scalar (Absolute homogeneity)If N ( x ) = 0 N(\boldsymbol{x}) = 0 N ( x ) = 0 , then x = 0 \boldsymbol{x} = \boldsymbol{0} x = 0 (Positive definiteness) For x ∈ V \boldsymbol{x} \in V x ∈ V , the most commonly used norms are:
Norm Notation Definition Use case Manhattan, L 1 L^1 L 1 ∥ x ∥ 1 \| \boldsymbol{x} \|_1 ∥ x ∥ 1 ∑ i = 1 n ∥ x i ∥ \displaystyle\sum_{i=1}^{n} \|x_i\| i = 1 ∑ n ∥ x i ∥ LASSO regularization Euclidean, L 2 L^2 L 2 ∥ x ∥ 2 \| \boldsymbol{x} \|_2 ∥ x ∥ 2 ∑ i = 1 n x i 2 \displaystyle\sqrt{\sum_{i=1}^{n} x_i^2} i = 1 ∑ n x i 2 Ridge regularization p p p -norm, L p L^p L p ∥ x ∥ p \| \boldsymbol{x} \|_p ∥ x ∥ p ( ∑ i = 1 n x i p ) 1 p \displaystyle\left(\sum_{i=1}^{n} x_i^p\right)^{\frac{1}{p}} ( i = 1 ∑ n x i p ) p 1 Hölder inequality Infinity, L ∞ L^\infty L ∞ ∥ x ∥ ∞ \| \boldsymbol{x} \|_\infty ∥ x ∥ ∞ max i ∥ x i ∥ \displaystyle\max_i \|x_i\| i max ∥ x i ∥ Uniform convergence
Matrix Calculus Matrix-Vector Product Derivatives :
∂ A x ∂ A = x T \frac{\partial \mathbf{Ax}}{\partial \mathbf{A}} = \mathbf{x}^T ∂ A ∂ Ax = x T ∂ A x ∂ x = A \frac{\partial \mathbf{Ax}}{\partial \mathbf{x}} = \mathbf{A} ∂ x ∂ Ax = A Matrix Product Derivatives :
∂ A B ∂ A = B T \frac{\partial \mathbf{AB}}{\partial \mathbf{A}} = \mathbf{B}^T ∂ A ∂ AB = B T ∂ A B ∂ B = A T \frac{\partial \mathbf{AB}}{\partial \mathbf{B}} = \mathbf{A}^T ∂ B ∂ AB = A T Chain of Matrix-Vector Products :
For Y = E X w \mathbf{Y} = \mathbf{EXw} Y = EXw where E \mathbf{E} E and X \mathbf{X} X are matrices and w \mathbf{w} w is a vector:
∂ ( E X w ) ∂ w = E X \frac{\partial (\mathbf{EXw})}{\partial \mathbf{w}} = \mathbf{EX} ∂ w ∂ ( EXw ) = EX ∂ ( E X w ) ∂ X = E T \frac{\partial (\mathbf{EXw})}{\partial \mathbf{X}} = \mathbf{E}^T ∂ X ∂ ( EXw ) = E T ∂ ( E X w ) ∂ E = ( X w ) T \frac{\partial (\mathbf{EXw})}{\partial \mathbf{E}} = (\mathbf{Xw})^T ∂ E ∂ ( EXw ) = ( Xw ) T Dot Product Derivatives :
For z = y T x z = \mathbf{y}^T\mathbf{x} z = y T x :
∂ z ∂ x = y , ∂ z ∂ y = x \frac{\partial z}{\partial \mathbf{x}} = \mathbf{y}, \quad \frac{\partial z}{\partial \mathbf{y}} = \mathbf{x} ∂ x ∂ z = y , ∂ y ∂ z = x Quadratic Form (very important!):
For f ( x ) = x T A x f(\mathbf{x}) = \mathbf{x}^T\mathbf{Ax} f ( x ) = x T Ax :
∂ ( x T A x ) ∂ x = ( A + A T ) x \frac{\partial (\mathbf{x}^T\mathbf{Ax})}{\partial \mathbf{x}} = (\mathbf{A} + \mathbf{A}^T)\mathbf{x} ∂ x ∂ ( x T Ax ) = ( A + A T ) x If A \mathbf{A} A is symmetric:
∂ ( x T A x ) ∂ x = 2 A x \frac{\partial (\mathbf{x}^T\mathbf{Ax})}{\partial \mathbf{x}} = 2\mathbf{Ax} ∂ x ∂ ( x T Ax ) = 2 Ax Linear Form :
∂ ( a T x ) ∂ x = a \frac{\partial (\mathbf{a}^T\mathbf{x})}{\partial \mathbf{x}} = \mathbf{a} ∂ x ∂ ( a T x ) = a ∂ ( x T a ) ∂ x = a \frac{\partial (\mathbf{x}^T\mathbf{a})}{\partial \mathbf{x}} = \mathbf{a} ∂ x ∂ ( x T a ) = a Function Linearity in Parameters :
A function f ( θ ) f(\theta) f ( θ ) is linear in θ \theta θ if it can be written as a dot product a ⊤ θ \mathbf{a}^{\top}\theta a ⊤ θ (or θ ⊤ a \theta^{\top}\mathbf{a} θ ⊤ a ), where a \mathbf{a} a is a vector of constants or input features that do not depend on θ \theta θ .
Example : Consider the model:
y = θ ⊤ ( x ∣ ∣ x ∣ ∣ 2 ) y = \theta^{\top}\left(\frac{\mathbf{x}}{||\mathbf{x}||_{2}}\right) y = θ ⊤ ( ∣∣ x ∣ ∣ 2 x ) The term ( x ∣ ∣ x ∣ ∣ 2 ) \left(\frac{\mathbf{x}}{||\mathbf{x}||_{2}}\right) ( ∣∣ x ∣ ∣ 2 x ) is a transformed feature vector (let's call it x ′ \mathbf{x}' x ′ ) that does not depend on θ \theta θ . Thus, y = θ ⊤ x ′ y = \theta^{\top}\mathbf{x}' y = θ ⊤ x ′ is a linear function of θ \theta θ , even though it involves normalization of the input features.
Spherical Codes : When ∣ ∣ x ∣ ∣ 2 = 1 ||\mathbf{x}||_2 = 1 ∣∣ x ∣ ∣ 2 = 1 , the feature vector lies on the unit sphere (unit circle in 2D, unit sphere in 3D, etc.). This normalization is common in machine learning because:
It removes scale dependence between features The dot product θ ⊤ x \theta^{\top}\mathbf{x} θ ⊤ x directly relates to the angle between θ \theta θ and x \mathbf{x} x : θ ⊤ x = ∣ ∣ θ ∣ ∣ 2 cos ( ∠ ( θ , x ) ) \theta^{\top}\mathbf{x} = ||\theta||_2 \cos(\angle(\theta, \mathbf{x})) θ ⊤ x = ∣∣ θ ∣ ∣ 2 cos ( ∠ ( θ , x )) It simplifies optimization and makes models more robust to feature scaling Key Point : A function can be linear in the parameters θ \theta θ even if it involves non-linear transformations of the input features x \mathbf{x} x , as long as those transformations do not depend on θ \theta θ .
Norm Squared :
∂ ( x T x ) ∂ x = 2 x \frac{\partial (\mathbf{x}^T\mathbf{x})}{\partial \mathbf{x}} = 2\mathbf{x} ∂ x ∂ ( x T x ) = 2 x ∂ ∥ x ∥ 2 ∂ x = 2 x \frac{\partial \|\mathbf{x}\|^2}{\partial \mathbf{x}} = 2\mathbf{x} ∂ x ∂ ∥ x ∥ 2 = 2 x Hessian Matrix :
For a scalar function f : R n → R f: \mathbb{R}^n \to \mathbb{R} f : R n → R , the Hessian matrix is the matrix of second-order partial derivatives:
H = ∇ 2 f ( x ) = [ ∂ 2 f ∂ x 1 2 ∂ 2 f ∂ x 1 ∂ x 2 ⋯ ∂ 2 f ∂ x 1 ∂ x n ∂ 2 f ∂ x 2 ∂ x 1 ∂ 2 f ∂ x 2 2 ⋯ ∂ 2 f ∂ x 2 ∂ x n ⋮ ⋮ ⋱ ⋮ ∂ 2 f ∂ x n ∂ x 1 ∂ 2 f ∂ x n ∂ x 2 ⋯ ∂ 2 f ∂ x n 2 ] \mathbf{H} = \nabla^2 f(\mathbf{x}) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix} H = ∇ 2 f ( x ) = ⎣ ⎡ ∂ x 1 2 ∂ 2 f ∂ x 2 ∂ x 1 ∂ 2 f ⋮ ∂ x n ∂ x 1 ∂ 2 f ∂ x 1 ∂ x 2 ∂ 2 f ∂ x 2 2 ∂ 2 f ⋮ ∂ x n ∂ x 2 ∂ 2 f ⋯ ⋯ ⋱ ⋯ ∂ x 1 ∂ x n ∂ 2 f ∂ x 2 ∂ x n ∂ 2 f ⋮ ∂ x n 2 ∂ 2 f ⎦ ⎤ Or more compactly: H i j = ∂ 2 f ∂ x i ∂ x j H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j} H ij = ∂ x i ∂ x j ∂ 2 f
Matrix Definiteness For a symmetric matrix H \mathbf{H} H , we can classify its definiteness based on the quadratic form z T H z \mathbf{z}^T\mathbf{H}\mathbf{z} z T Hz :
Type Condition Eigenvalues Use in ML Positive Definite (PD) z T H z > 0 \mathbf{z}^T\mathbf{H}\mathbf{z} > 0 z T Hz > 0 for all z ≠ 0 \mathbf{z} \neq \mathbf{0} z = 0 All λ i > 0 \lambda_i > 0 λ i > 0 Convex optimization (unique minimum) Positive Semi-Definite (PSD) z T H z ≥ 0 \mathbf{z}^T\mathbf{H}\mathbf{z} \geq 0 z T Hz ≥ 0 for all z \mathbf{z} z All λ i ≥ 0 \lambda_i \geq 0 λ i ≥ 0 Convex functions Negative Definite (ND) z T H z < 0 \mathbf{z}^T\mathbf{H}\mathbf{z} < 0 z T Hz < 0 for all z ≠ 0 \mathbf{z} \neq \mathbf{0} z = 0 All λ i < 0 \lambda_i < 0 λ i < 0 Concave optimization (unique maximum) Negative Semi-Definite (NSD) z T H z ≤ 0 \mathbf{z}^T\mathbf{H}\mathbf{z} \leq 0 z T Hz ≤ 0 for all z \mathbf{z} z All λ i ≤ 0 \lambda_i \leq 0 λ i ≤ 0 Concave functions Indefinite Otherwise Mixed signs Saddle points
Connection to Convexity :
For a twice-differentiable function f ( x ) f(\mathbf{x}) f ( x ) :
f f f is convex if and only if ∇ 2 f ( x ) \nabla^2 f(\mathbf{x}) ∇ 2 f ( x ) (Hessian) is PSD everywheref f f is strictly convex if and only if ∇ 2 f ( x ) \nabla^2 f(\mathbf{x}) ∇ 2 f ( x ) is PD everywheref f f is concave if and only if ∇ 2 f ( x ) \nabla^2 f(\mathbf{x}) ∇ 2 f ( x ) is NSD everywheref f f is strictly concave if and only if ∇ 2 f ( x ) \nabla^2 f(\mathbf{x}) ∇ 2 f ( x ) is ND everywhere