Probability and Statistics This guide covers the essential probability and statistics concepts needed for machine learning, including distributions, expected values, Bayes' theorem, and maximum likelihood estimation.
Probability Basics Probability Rules :
P ( A ∪ B ) = P ( A ) + P ( B ) − P ( A ∩ B ) P ( A ∩ B ) = P ( A ∣ B ) P ( B ) = P ( B ∣ A ) P ( A ) \begin{align} P(A \cup B) &= P(A) + P(B) - P(A \cap B) \\ P(A \cap B) &= P(A|B)P(B) = P(B|A)P(A) \end{align} P ( A ∪ B ) P ( A ∩ B ) = P ( A ) + P ( B ) − P ( A ∩ B ) = P ( A ∣ B ) P ( B ) = P ( B ∣ A ) P ( A ) Conditional Probability :
P ( A ∣ B ) = P ( A ∩ B ) P ( B ) P(A|B) = \frac{P(A \cap B)}{P(B)} P ( A ∣ B ) = P ( B ) P ( A ∩ B ) Independence : P ( A ∩ B ) = P ( A ) P ( B ) P(A \cap B) = P(A)P(B) P ( A ∩ B ) = P ( A ) P ( B )
Bayes' Theorem P ( A ∣ B ) = P ( B ∣ A ) P ( A ) P ( B ) P(A|B) = \frac{P(B|A)P(A)}{P(B)} P ( A ∣ B ) = P ( B ) P ( B ∣ A ) P ( A ) Extended Form :
P ( A ∣ B ) = P ( B ∣ A ) P ( A ) P ( B ∣ A ) P ( A ) + P ( B ∣ ¬ A ) P ( ¬ A ) P(A|B) = \frac{P(B|A)P(A)}{P(B|A)P(A) + P(B|\neg A)P(\neg A)} P ( A ∣ B ) = P ( B ∣ A ) P ( A ) + P ( B ∣¬ A ) P ( ¬ A ) P ( B ∣ A ) P ( A ) ML Context (posterior, likelihood, prior, evidence):
P ( θ ∣ D ) = P ( D ∣ θ ) P ( θ ) P ( D ) P(\theta|D) = \frac{P(D|\theta)P(\theta)}{P(D)} P ( θ ∣ D ) = P ( D ) P ( D ∣ θ ) P ( θ ) Expected Value and Variance Expected Value (Mean) :
E [ X ] = μ = ∑ i x i P ( x i ) (discrete) \mathbb{E}[X] = \mu = \sum_{i} x_i P(x_i) \quad \text{(discrete)} E [ X ] = μ = i ∑ x i P ( x i ) (discrete) E [ X ] = ∫ x f ( x ) d x (continuous) \mathbb{E}[X] = \int x f(x) dx \quad \text{(continuous)} E [ X ] = ∫ x f ( x ) d x (continuous) Properties :
E [ a X + b ] = a E [ X ] + b \mathbb{E}[aX + b] = a\mathbb{E}[X] + b E [ a X + b ] = a E [ X ] + b E [ X + Y ] = E [ X ] + E [ Y ] \mathbb{E}[X + Y] = \mathbb{E}[X] + \mathbb{E}[Y] E [ X + Y ] = E [ X ] + E [ Y ] Variance :
Var ( X ) = σ 2 = E [ ( X − μ ) 2 ] = E [ X 2 ] − ( E [ X ] ) 2 \text{Var}(X) = \sigma^2 = \mathbb{E}[(X - \mu)^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 Var ( X ) = σ 2 = E [( X − μ ) 2 ] = E [ X 2 ] − ( E [ X ] ) 2 Properties :
Var ( a X + b ) = a 2 Var ( X ) \text{Var}(aX + b) = a^2\text{Var}(X) Var ( a X + b ) = a 2 Var ( X ) If X X X and Y Y Y are independent: Var ( X + Y ) = Var ( X ) + Var ( Y ) \text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) Var ( X + Y ) = Var ( X ) + Var ( Y ) Standard Deviation : σ = Var ( X ) \sigma = \sqrt{\text{Var}(X)} σ = Var ( X )
Covariance :
Cov ( X , Y ) = E [ ( X − μ X ) ( Y − μ Y ) ] = E [ X Y ] − E [ X ] E [ Y ] \text{Cov}(X, Y) = \mathbb{E}[(X - \mu_X)(Y - \mu_Y)] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y] Cov ( X , Y ) = E [( X − μ X ) ( Y − μ Y )] = E [ X Y ] − E [ X ] E [ Y ] Correlation :
ρ X Y = Cov ( X , Y ) σ X σ Y \rho_{XY} = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y} ρ X Y = σ X σ Y Cov ( X , Y ) where − 1 ≤ ρ ≤ 1 -1 \leq \rho \leq 1 − 1 ≤ ρ ≤ 1 .
Common Probability Distributions Bernoulli Distribution Models a single binary trial (coin flip).
P ( X = x ) = p x ( 1 − p ) 1 − x , x ∈ { 0 , 1 } P(X = x) = p^x(1-p)^{1-x}, \quad x \in \{0, 1\} P ( X = x ) = p x ( 1 − p ) 1 − x , x ∈ { 0 , 1 } Mean : E [ X ] = p \mathbb{E}[X] = p E [ X ] = p Variance : Var ( X ) = p ( 1 − p ) \text{Var}(X) = p(1-p) Var ( X ) = p ( 1 − p ) Binomial Distribution Number of successes in n n n Bernoulli trials.
P ( X = k ) = ( n k ) p k ( 1 − p ) n − k P(X = k) = \binom{n}{k}p^k(1-p)^{n-k} P ( X = k ) = ( k n ) p k ( 1 − p ) n − k Mean : E [ X ] = n p \mathbb{E}[X] = np E [ X ] = n p Variance : Var ( X ) = n p ( 1 − p ) \text{Var}(X) = np(1-p) Var ( X ) = n p ( 1 − p ) Gaussian (Normal) Distribution f ( x ) = 1 2 π σ 2 exp ( − ( x − μ ) 2 2 σ 2 ) f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) f ( x ) = 2 π σ 2 1 exp ( − 2 σ 2 ( x − μ ) 2 ) Notation: X ∼ N ( μ , σ 2 ) X \sim \mathcal{N}(\mu, \sigma^2) X ∼ N ( μ , σ 2 )
Mean : E [ X ] = μ \mathbb{E}[X] = \mu E [ X ] = μ Variance : Var ( X ) = σ 2 \text{Var}(X) = \sigma^2 Var ( X ) = σ 2 Standard Normal : N ( 0 , 1 ) \mathcal{N}(0, 1) N ( 0 , 1 )
Properties :
Linear combinations of Gaussians are Gaussian If X ∼ N ( μ , σ 2 ) X \sim \mathcal{N}(\mu, \sigma^2) X ∼ N ( μ , σ 2 ) , then a X + b ∼ N ( a μ + b , a 2 σ 2 ) aX + b \sim \mathcal{N}(a\mu + b, a^2\sigma^2) a X + b ∼ N ( a μ + b , a 2 σ 2 ) Multivariate Gaussian f ( x ) = 1 ( 2 π ) n / 2 ∣ Σ ∣ 1 / 2 exp ( − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) ) f(\mathbf{x}) = \frac{1}{(2\pi)^{n/2}|\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T\boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right) f ( x ) = ( 2 π ) n /2 ∣ Σ ∣ 1/2 1 exp ( − 2 1 ( x − μ ) T Σ − 1 ( x − μ ) ) Notation: X ∼ N ( μ , Σ ) \mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}) X ∼ N ( μ , Σ )
Mean : E [ X ] = μ \mathbb{E}[\mathbf{X}] = \boldsymbol{\mu} E [ X ] = μ Covariance : Cov ( X ) = Σ \text{Cov}(\mathbf{X}) = \boldsymbol{\Sigma} Cov ( X ) = Σ Exponential Distribution Models time between events in a Poisson process.
f ( x ) = λ e − λ x , x ≥ 0 f(x) = \lambda e^{-\lambda x}, \quad x \geq 0 f ( x ) = λ e − λ x , x ≥ 0 Mean : E [ X ] = 1 λ \mathbb{E}[X] = \frac{1}{\lambda} E [ X ] = λ 1 Variance : Var ( X ) = 1 λ 2 \text{Var}(X) = \frac{1}{\lambda^2} Var ( X ) = λ 2 1 f ( x ) = 1 b − a , x ∈ [ a , b ] f(x) = \frac{1}{b-a}, \quad x \in [a, b] f ( x ) = b − a 1 , x ∈ [ a , b ] Mean : E [ X ] = a + b 2 \mathbb{E}[X] = \frac{a+b}{2} E [ X ] = 2 a + b Variance : Var ( X ) = ( b − a ) 2 12 \text{Var}(X) = \frac{(b-a)^2}{12} Var ( X ) = 12 ( b − a ) 2 Maximum Likelihood Estimation (MLE) Given data D = { x 1 , . . . , x n } D = \{x_1, ..., x_n\} D = { x 1 , ... , x n } and model parameter θ \theta θ :
Likelihood :
L ( θ ) = P ( D ∣ θ ) = ∏ i = 1 n P ( x i ∣ θ ) L(\theta) = P(D|\theta) = \prod_{i=1}^n P(x_i|\theta) L ( θ ) = P ( D ∣ θ ) = i = 1 ∏ n P ( x i ∣ θ ) Log-Likelihood (easier to work with):
ℓ ( θ ) = log L ( θ ) = ∑ i = 1 n log P ( x i ∣ θ ) \ell(\theta) = \log L(\theta) = \sum_{i=1}^n \log P(x_i|\theta) ℓ ( θ ) = log L ( θ ) = i = 1 ∑ n log P ( x i ∣ θ ) MLE : Find θ \theta θ that maximizes ℓ ( θ ) \ell(\theta) ℓ ( θ ) :
θ ^ M L E = arg max θ ℓ ( θ ) \hat{\theta}_{MLE} = \arg\max_\theta \ell(\theta) θ ^ M L E = arg θ max ℓ ( θ ) Example : For Gaussian N ( μ , σ 2 ) \mathcal{N}(\mu, \sigma^2) N ( μ , σ 2 ) :
μ ^ M L E = 1 n ∑ i = 1 n x i , σ ^ M L E 2 = 1 n ∑ i = 1 n ( x i − μ ^ ) 2 \hat{\mu}_{MLE} = \frac{1}{n}\sum_{i=1}^n x_i, \quad \hat{\sigma}^2_{MLE} = \frac{1}{n}\sum_{i=1}^n (x_i - \hat{\mu})^2 μ ^ M L E = n 1 i = 1 ∑ n x i , σ ^ M L E 2 = n 1 i = 1 ∑ n ( x i − μ ^ ) 2