Skip to main content

Linear Algebra

This guide covers the essential linear algebra concepts needed for machine learning, including vectors, matrices, matrix calculus, and their applications.

Basics

Vector: x=[x1,x2,...,xn]T\mathbf{x} = [x_1, x_2, ..., x_n]^T

Matrix: ARm×n\mathbf{A} \in \mathbb{R}^{m \times n}

A=[a11a12a1na21a22a2nam1am2amn]\mathbf{A} = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{bmatrix}

Transpose: (AT)ij=Aji(\mathbf{A}^T)_{ij} = \mathbf{A}_{ji}

Matrix Multiplication: C=AB\mathbf{C} = \mathbf{AB} where Cij=kAikBkjC_{ij} = \sum_k A_{ik}B_{kj} Identity Matrix: In\mathbf{I}_n where Iij=1I_{ij} = 1 if i=ji=j, else 00 Inverse: A1A=AA1=I\mathbf{A}^{-1}\mathbf{A} = \mathbf{AA}^{-1} = \mathbf{I}

Solving Matrix Equations: For Ax=b\mathbf{A}\mathbf{x} = \mathbf{b} where A\mathbf{A} is invertible:

x=A1b\mathbf{x} = \mathbf{A}^{-1}\mathbf{b}

This is used to isolate variables in matrix equations (e.g., in normal equations for linear regression).

Properties:

  • (AB)T=BTAT(\mathbf{AB})^T = \mathbf{B}^T\mathbf{A}^T
  • (AT)T=A(\mathbf{A}^T)^T = \mathbf{A}
  • (A+B)T=AT+BT(\mathbf{A} + \mathbf{B})^T = \mathbf{A}^T + \mathbf{B}^T
  • (AB)1=B1A1(\mathbf{AB})^{-1} = \mathbf{B}^{-1}\mathbf{A}^{-1}
  • (AT)1=(A1)T(\mathbf{A}^T)^{-1} = (\mathbf{A}^{-1})^T

Block Matrix Multiplication: When matrices are partitioned into blocks, multiplication follows the same rules:

For A=[A1  A2]\mathbf{A} = [\mathbf{A}_1 \; \mathbf{A}_2] (horizontal partition) and B=[B1B2]\mathbf{B} = \begin{bmatrix} \mathbf{B}_1 \\ \mathbf{B}_2 \end{bmatrix} (vertical partition):

AB=[A1  A2][B1B2]=A1B1+A2B2\mathbf{AB} = [\mathbf{A}_1 \; \mathbf{A}_2] \begin{bmatrix} \mathbf{B}_1 \\ \mathbf{B}_2 \end{bmatrix} = \mathbf{A}_1\mathbf{B}_1 + \mathbf{A}_2\mathbf{B}_2

For partitioned matrices [A11A12A21A22]\begin{bmatrix} \mathbf{A}_{11} & \mathbf{A}_{12} \\ \mathbf{A}_{21} & \mathbf{A}_{22} \end{bmatrix} and [B11B12B21B22]\begin{bmatrix} \mathbf{B}_{11} & \mathbf{B}_{12} \\ \mathbf{B}_{21} & \mathbf{B}_{22} \end{bmatrix}:

[A11A12A21A22][B11B12B21B22]=[A11B11+A12B21A11B12+A12B22A21B11+A22B21A21B12+A22B22]\begin{bmatrix} \mathbf{A}_{11} & \mathbf{A}_{12} \\ \mathbf{A}_{21} & \mathbf{A}_{22} \end{bmatrix} \begin{bmatrix} \mathbf{B}_{11} & \mathbf{B}_{12} \\ \mathbf{B}_{21} & \mathbf{B}_{22} \end{bmatrix} = \begin{bmatrix} \mathbf{A}_{11}\mathbf{B}_{11} + \mathbf{A}_{12}\mathbf{B}_{21} & \mathbf{A}_{11}\mathbf{B}_{12} + \mathbf{A}_{12}\mathbf{B}_{22} \\ \mathbf{A}_{21}\mathbf{B}_{11} + \mathbf{A}_{22}\mathbf{B}_{21} & \mathbf{A}_{21}\mathbf{B}_{12} + \mathbf{A}_{22}\mathbf{B}_{22} \end{bmatrix}

Key Pattern: Block multiplication works like regular matrix multiplication, treating blocks as scalars.

Trace and Determinant

Trace: Sum of diagonal elements

tr(A)=i=1nAii\text{tr}(\mathbf{A}) = \sum_{i=1}^n A_{ii}

Properties:

  • tr(A+B)=tr(A)+tr(B)\text{tr}(\mathbf{A} + \mathbf{B}) = \text{tr}(\mathbf{A}) + \text{tr}(\mathbf{B})
  • tr(AB)=tr(BA)\text{tr}(\mathbf{AB}) = \text{tr}(\mathbf{BA})
  • tr(AT)=tr(A)\text{tr}(\mathbf{A}^T) = \text{tr}(\mathbf{A})

Determinant: det(A)\det(\mathbf{A}) or A|\mathbf{A}|

For 2×2 matrix:

det[abcd]=adbc\det\begin{bmatrix} a & b \\ c & d \end{bmatrix} = ad - bc
Eigenvalues and Eigenvectors

For a square matrix A\mathbf{A}:

Av=λv\mathbf{Av} = \lambda \mathbf{v}

where λ\lambda is the eigenvalue and v\mathbf{v} is the eigenvector.

Diagonalizable Matrix: A matrix A\mathbf{A} is diagonalizable if it can be written as:

A=PDP1\mathbf{A} = \mathbf{PDP}^{-1}

where:

  • D\mathbf{D} is a diagonal matrix containing the eigenvalues of A\mathbf{A}
  • P\mathbf{P} is an invertible matrix whose columns are the eigenvectors of A\mathbf{A} (corresponding to the eigenvalues in D\mathbf{D})

Conditions for Diagonalizability:

  • A\mathbf{A} has nn linearly independent eigenvectors (where nn is the dimension of A\mathbf{A})
  • If A\mathbf{A} is symmetric, it is always diagonalizable (Spectral Theorem)

Properties:

  • det(A)=iλi\det(\mathbf{A}) = \prod_i \lambda_i
  • tr(A)=iλi\text{tr}(\mathbf{A}) = \sum_i \lambda_i

Norms

Norm — A norm is a function N:V[0,+[N : V \longrightarrow [0, +\infty[ where VV is a vector space, and such that for all x,yV\boldsymbol{x}, \boldsymbol{y} \in V, we have:

  • N(x+y)N(x)+N(y)N(\boldsymbol{x} + \boldsymbol{y}) \leq N(\boldsymbol{x}) + N(\boldsymbol{y}) (Triangle inequality)
  • N(ax)=aN(x)N(a\boldsymbol{x}) = |a|N(\boldsymbol{x}) for aa scalar (Absolute homogeneity)
  • If N(x)=0N(\boldsymbol{x}) = 0, then x=0\boldsymbol{x} = \boldsymbol{0} (Positive definiteness)

For xV\boldsymbol{x} \in V, the most commonly used norms are:

NormNotationDefinitionUse case
Manhattan, L1L^1x1\| \boldsymbol{x} \|_1i=1nxi\displaystyle\sum_{i=1}^{n} \|x_i\|LASSO regularization
Euclidean, L2L^2x2\| \boldsymbol{x} \|_2i=1nxi2\displaystyle\sqrt{\sum_{i=1}^{n} x_i^2}Ridge regularization
pp-norm, LpL^pxp\| \boldsymbol{x} \|_p(i=1nxip)1p\displaystyle\left(\sum_{i=1}^{n} x_i^p\right)^{\frac{1}{p}}Hölder inequality
Infinity, LL^\inftyx\| \boldsymbol{x} \|_\inftymaxixi\displaystyle\max_i \|x_i\|Uniform convergence

Matrix Calculus

Matrix-Vector Product Derivatives:

AxA=xT\frac{\partial \mathbf{Ax}}{\partial \mathbf{A}} = \mathbf{x}^T
Axx=A\frac{\partial \mathbf{Ax}}{\partial \mathbf{x}} = \mathbf{A}

Matrix Product Derivatives:

ABA=BT\frac{\partial \mathbf{AB}}{\partial \mathbf{A}} = \mathbf{B}^T
ABB=AT\frac{\partial \mathbf{AB}}{\partial \mathbf{B}} = \mathbf{A}^T

Chain of Matrix-Vector Products:

For Y=EXw\mathbf{Y} = \mathbf{EXw} where E\mathbf{E} and X\mathbf{X} are matrices and w\mathbf{w} is a vector:

(EXw)w=EX\frac{\partial (\mathbf{EXw})}{\partial \mathbf{w}} = \mathbf{EX}
(EXw)X=ET\frac{\partial (\mathbf{EXw})}{\partial \mathbf{X}} = \mathbf{E}^T
(EXw)E=(Xw)T\frac{\partial (\mathbf{EXw})}{\partial \mathbf{E}} = (\mathbf{Xw})^T

Dot Product Derivatives:

For z=yTxz = \mathbf{y}^T\mathbf{x}:

zx=y,zy=x\frac{\partial z}{\partial \mathbf{x}} = \mathbf{y}, \quad \frac{\partial z}{\partial \mathbf{y}} = \mathbf{x}

Quadratic Form (very important!):

For f(x)=xTAxf(\mathbf{x}) = \mathbf{x}^T\mathbf{Ax}:

(xTAx)x=(A+AT)x\frac{\partial (\mathbf{x}^T\mathbf{Ax})}{\partial \mathbf{x}} = (\mathbf{A} + \mathbf{A}^T)\mathbf{x}

If A\mathbf{A} is symmetric:

(xTAx)x=2Ax\frac{\partial (\mathbf{x}^T\mathbf{Ax})}{\partial \mathbf{x}} = 2\mathbf{Ax}

Linear Form:

(aTx)x=a\frac{\partial (\mathbf{a}^T\mathbf{x})}{\partial \mathbf{x}} = \mathbf{a}
(xTa)x=a\frac{\partial (\mathbf{x}^T\mathbf{a})}{\partial \mathbf{x}} = \mathbf{a}

Function Linearity in Parameters:

A function f(θ)f(\theta) is linear in θ\theta if it can be written as a dot product aθ\mathbf{a}^{\top}\theta (or θa\theta^{\top}\mathbf{a}), where a\mathbf{a} is a vector of constants or input features that do not depend on θ\theta.

Example: Consider the model:

y=θ(xx2)y = \theta^{\top}\left(\frac{\mathbf{x}}{||\mathbf{x}||_{2}}\right)

The term (xx2)\left(\frac{\mathbf{x}}{||\mathbf{x}||_{2}}\right) is a transformed feature vector (let's call it x\mathbf{x}') that does not depend on θ\theta. Thus, y=θxy = \theta^{\top}\mathbf{x}' is a linear function of θ\theta, even though it involves normalization of the input features.

info

Spherical Codes: When x2=1||\mathbf{x}||_2 = 1, the feature vector lies on the unit sphere (unit circle in 2D, unit sphere in 3D, etc.). This normalization is common in machine learning because:

  • It removes scale dependence between features
  • The dot product θx\theta^{\top}\mathbf{x} directly relates to the angle between θ\theta and x\mathbf{x}: θx=θ2cos((θ,x))\theta^{\top}\mathbf{x} = ||\theta||_2 \cos(\angle(\theta, \mathbf{x}))
  • It simplifies optimization and makes models more robust to feature scaling

Key Point: A function can be linear in the parameters θ\theta even if it involves non-linear transformations of the input features x\mathbf{x}, as long as those transformations do not depend on θ\theta.

Norm Squared:

(xTx)x=2x\frac{\partial (\mathbf{x}^T\mathbf{x})}{\partial \mathbf{x}} = 2\mathbf{x}
x2x=2x\frac{\partial \|\mathbf{x}\|^2}{\partial \mathbf{x}} = 2\mathbf{x}

Hessian Matrix:

For a scalar function f:RnRf: \mathbb{R}^n \to \mathbb{R}, the Hessian matrix is the matrix of second-order partial derivatives:

H=2f(x)=[2fx122fx1x22fx1xn2fx2x12fx222fx2xn2fxnx12fxnx22fxn2]\mathbf{H} = \nabla^2 f(\mathbf{x}) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix}

Or more compactly: Hij=2fxixjH_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}

Matrix Definiteness

For a symmetric matrix H\mathbf{H}, we can classify its definiteness based on the quadratic form zTHz\mathbf{z}^T\mathbf{H}\mathbf{z}:

TypeConditionEigenvaluesUse in ML
Positive Definite (PD)zTHz>0\mathbf{z}^T\mathbf{H}\mathbf{z} > 0 for all z0\mathbf{z} \neq \mathbf{0}All λi>0\lambda_i > 0Convex optimization (unique minimum)
Positive Semi-Definite (PSD)zTHz0\mathbf{z}^T\mathbf{H}\mathbf{z} \geq 0 for all z\mathbf{z}All λi0\lambda_i \geq 0Convex functions
Negative Definite (ND)zTHz<0\mathbf{z}^T\mathbf{H}\mathbf{z} < 0 for all z0\mathbf{z} \neq \mathbf{0}All λi<0\lambda_i < 0Concave optimization (unique maximum)
Negative Semi-Definite (NSD)zTHz0\mathbf{z}^T\mathbf{H}\mathbf{z} \leq 0 for all z\mathbf{z}All λi0\lambda_i \leq 0Concave functions
IndefiniteOtherwiseMixed signsSaddle points

Connection to Convexity:

For a twice-differentiable function f(x)f(\mathbf{x}):

  • ff is convex if and only if 2f(x)\nabla^2 f(\mathbf{x}) (Hessian) is PSD everywhere
  • ff is strictly convex if and only if 2f(x)\nabla^2 f(\mathbf{x}) is PD everywhere
  • ff is concave if and only if 2f(x)\nabla^2 f(\mathbf{x}) is NSD everywhere
  • ff is strictly concave if and only if 2f(x)\nabla^2 f(\mathbf{x}) is ND everywhere