Statistics Probability Basics Probability Rules :
P ( A ∪ B ) = P ( A ) + P ( B ) − P ( A ∩ B ) P ( A ∩ B ) = P ( A ∣ B ) P ( B ) = P ( B ∣ A ) P ( A ) \begin{align} P(A \cup B) &= P(A) + P(B) - P(A \cap B) \\ P(A \cap B) &= P(A|B)P(B) = P(B|A)P(A) \end{align} P ( A ∪ B ) P ( A ∩ B ) = P ( A ) + P ( B ) − P ( A ∩ B ) = P ( A ∣ B ) P ( B ) = P ( B ∣ A ) P ( A ) Conditional Probability :
P ( A ∣ B ) = P ( A ∩ B ) P ( B ) P(A|B) = \frac{P(A \cap B)}{P(B)} P ( A ∣ B ) = P ( B ) P ( A ∩ B ) Independence : P ( A ∩ B ) = P ( A ) P ( B ) P(A \cap B) = P(A)P(B) P ( A ∩ B ) = P ( A ) P ( B )
Bayes' Theorem P ( A ∣ B ) = P ( B ∣ A ) P ( A ) P ( B ) P(A|B) = \frac{P(B|A)P(A)}{P(B)} P ( A ∣ B ) = P ( B ) P ( B ∣ A ) P ( A ) Extended Form :
P ( A ∣ B ) = P ( B ∣ A ) P ( A ) P ( B ∣ A ) P ( A ) + P ( B ∣ ¬ A ) P ( ¬ A ) P(A|B) = \frac{P(B|A)P(A)}{P(B|A)P(A) + P(B|\neg A)P(\neg A)} P ( A ∣ B ) = P ( B ∣ A ) P ( A ) + P ( B ∣¬ A ) P ( ¬ A ) P ( B ∣ A ) P ( A ) ML Context (Parameter Estimation) :
In machine learning, we want to estimate parameters θ \theta θ given observed data D D D :
p ( θ ∣ D ) = p ( D ∣ θ ) ⋅ p ( θ ) p ( D ) p(\theta|D) = \frac{p(D|\theta) \cdot p(\theta)}{p(D)} p ( θ ∣ D ) = p ( D ) p ( D ∣ θ ) ⋅ p ( θ ) where:
p ( θ ∣ D ) p(\theta|D) p ( θ ∣ D ) is the posterior (probability of parameters given data)p ( D ∣ θ ) p(D|\theta) p ( D ∣ θ ) is the likelihood (probability of observing data given parameters)p ( θ ) p(\theta) p ( θ ) is the prior (prior belief about parameters before seeing data)p ( D ) p(D) p ( D ) is the evidence or marginal likelihood (normalizing constant)Law of Total Probability:
p ( D ) = ∫ p ( D ∣ θ ) ⋅ p ( θ ) d θ p(D) = \int p(D|\theta) \cdot p(\theta) \, d\theta p ( D ) = ∫ p ( D ∣ θ ) ⋅ p ( θ ) d θ Maximum A Posteriori (MAP) MAP estimation finds the most likely parameter values given the data:
θ ^ M A P = arg max θ p ( θ ∣ D ) = arg max θ p ( D ∣ θ ) ⋅ p ( θ ) p ( D ) \hat{\theta}_{MAP} = \arg\max_\theta \, p(\theta|D) = \arg\max_\theta \, \frac{p(D|\theta) \cdot p(\theta)}{p(D)} θ ^ M A P = arg θ max p ( θ ∣ D ) = arg θ max p ( D ) p ( D ∣ θ ) ⋅ p ( θ ) Since p ( D ) p(D) p ( D ) doesn't depend on θ \theta θ , this simplifies to:
θ ^ M A P = arg max θ p ( D ∣ θ ) ⋅ p ( θ ) \hat{\theta}_{MAP} = \arg\max_\theta \, p(D|\theta) \cdot p(\theta) θ ^ M A P = arg θ max p ( D ∣ θ ) ⋅ p ( θ ) Taking logarithm (for numerical stability):
θ ^ M A P = arg max θ [ log p ( D ∣ θ ) + log p ( θ ) ] \hat{\theta}_{MAP} = \arg\max_\theta \, [\log p(D|\theta) + \log p(\theta)] θ ^ M A P = arg θ max [ log p ( D ∣ θ ) + log p ( θ )] Comparison with MLE:
MLE : θ ^ M L E = arg max θ p ( D ∣ θ ) \hat{\theta}_{MLE} = \arg\max_\theta \, p(D|\theta) θ ^ M L E = arg max θ p ( D ∣ θ ) (ignores prior)MAP : θ ^ M A P = arg max θ p ( D ∣ θ ) ⋅ p ( θ ) \hat{\theta}_{MAP} = \arg\max_\theta \, p(D|\theta) \cdot p(\theta) θ ^ M A P = arg max θ p ( D ∣ θ ) ⋅ p ( θ ) (incorporates prior)If prior is uniform, MAP = MLE Application to Naive Bayes Classification For classification with Naive Bayes, we want to find:
y ^ = arg max y ∈ { 0 , 1 } p ( y ∣ x ) = arg max y ∈ { 0 , 1 } p ( x ∣ y ) ⋅ p ( y ) p ( x ) \hat{y} = \arg\max_{y \in \{0,1\}} \, p(y|x) = \arg\max_{y \in \{0,1\}} \, \frac{p(x|y) \cdot p(y)}{p(x)} y ^ = arg y ∈ { 0 , 1 } max p ( y ∣ x ) = arg y ∈ { 0 , 1 } max p ( x ) p ( x ∣ y ) ⋅ p ( y ) Since p ( x ) p(x) p ( x ) is constant for all classes, this becomes:
y ^ = arg max y ∈ { 0 , 1 } p ( x ∣ y ) ⋅ p ( y ) \hat{y} = \arg\max_{y \in \{0,1\}} \, p(x|y) \cdot p(y) y ^ = arg y ∈ { 0 , 1 } max p ( x ∣ y ) ⋅ p ( y ) This is MAP estimation where:
y y y is the parameter we're estimating (the class label)p ( y ) p(y) p ( y ) is the prior (class prior, e.g., p ( y = 1 ) = ϕ p(y=1) = \phi p ( y = 1 ) = ϕ )p ( x ∣ y ) p(x|y) p ( x ∣ y ) is the likelihood (computed using Naive Bayes assumption: p ( x ∣ y ) = ∏ j = 1 d p ( x j ∣ y ) p(x|y) = \prod_{j=1}^d p(x_j|y) p ( x ∣ y ) = ∏ j = 1 d p ( x j ∣ y ) )We pick the class that maximizes the posterior Why MAP works for Naive Bayes:
We treat the class label y y y as the "parameter" to estimate The prior p ( y ) p(y) p ( y ) represents class frequencies in training data The likelihood p ( x ∣ y ) p(x|y) p ( x ∣ y ) comes from the Naive Bayes assumption We choose the class with maximum posterior probability Extended Form (binary classification):
p ( y = 1 ∣ x ) = p ( x ∣ y = 1 ) ⋅ p ( y = 1 ) p ( x ∣ y = 1 ) ⋅ p ( y = 1 ) + p ( x ∣ y = 0 ) ⋅ p ( y = 0 ) p(y=1|x) = \frac{p(x|y=1) \cdot p(y=1)}{p(x|y=1) \cdot p(y=1) + p(x|y=0) \cdot p(y=0)} p ( y = 1∣ x ) = p ( x ∣ y = 1 ) ⋅ p ( y = 1 ) + p ( x ∣ y = 0 ) ⋅ p ( y = 0 ) p ( x ∣ y = 1 ) ⋅ p ( y = 1 ) where p ( y = 1 ) = ϕ p(y=1) = \phi p ( y = 1 ) = ϕ and p ( y = 0 ) = 1 − ϕ p(y=0) = 1-\phi p ( y = 0 ) = 1 − ϕ .
Expected Value and Variance Expected Value (Mean) :
E [ X ] = μ = ∑ i x i P ( x i ) (discrete) \mathbb{E}[X] = \mu = \sum_{i} x_i P(x_i) \quad \text{(discrete)} E [ X ] = μ = i ∑ x i P ( x i ) (discrete) E [ X ] = ∫ x f ( x ) d x (continuous) \mathbb{E}[X] = \int x f(x) dx \quad \text{(continuous)} E [ X ] = ∫ x f ( x ) d x (continuous) Properties :
E [ a X + b ] = a E [ X ] + b \mathbb{E}[aX + b] = a\mathbb{E}[X] + b E [ a X + b ] = a E [ X ] + b E [ X + Y ] = E [ X ] + E [ Y ] \mathbb{E}[X + Y] = \mathbb{E}[X] + \mathbb{E}[Y] E [ X + Y ] = E [ X ] + E [ Y ] Jensen's inequality: E [ f ( X ) ] ≥ f ( E [ X ] ) \mathbb{E}[f(X)] \geq f(\mathbb{E}[X]) E [ f ( X )] ≥ f ( E [ X ]) for convex function f f f . Variance :
Var ( X ) = σ 2 = E [ ( X − μ ) 2 ] = E [ X 2 ] − ( E [ X ] ) 2 \text{Var}(X) = \sigma^2 = \mathbb{E}[(X - \mu)^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 Var ( X ) = σ 2 = E [( X − μ ) 2 ] = E [ X 2 ] − ( E [ X ] ) 2 Properties :
Var ( a X + b ) = a 2 Var ( X ) \text{Var}(aX + b) = a^2\text{Var}(X) Var ( a X + b ) = a 2 Var ( X ) If X X X and Y Y Y are independent: Var ( X + Y ) = Var ( X ) + Var ( Y ) \text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) Var ( X + Y ) = Var ( X ) + Var ( Y ) Variance is always non-negative: Var ( X ) = E [ X 2 ] − ( E [ X ] ) 2 ≥ 0 \text{Var}(X) = \mathbb{E}[X^{2}] - (\mathbb{E}[X])^{2} \geq 0 Var ( X ) = E [ X 2 ] − ( E [ X ] ) 2 ≥ 0 . Standard Deviation : σ = Var ( X ) \sigma = \sqrt{\text{Var}(X)} σ = Var ( X )
Covariance :
Cov ( X , Y ) = E [ ( X − μ X ) ( Y − μ Y ) ] = E [ X Y ] − E [ X ] E [ Y ] \text{Cov}(X, Y) = \mathbb{E}[(X - \mu_X)(Y - \mu_Y)] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y] Cov ( X , Y ) = E [( X − μ X ) ( Y − μ Y )] = E [ X Y ] − E [ X ] E [ Y ] Correlation :
ρ X Y = Cov ( X , Y ) σ X σ Y \rho_{XY} = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y} ρ X Y = σ X σ Y Cov ( X , Y ) where − 1 ≤ ρ ≤ 1 -1 \leq \rho \leq 1 − 1 ≤ ρ ≤ 1 .
Common Probability Distributions Distribution PDF or PMF Support / Values Mean Variance Bernoulli ( p ) (p) ( p ) { p , x = 1 1 − p , x = 0 \begin{cases} p, & x=1 \\ 1-p, & x=0 \end{cases} { p , 1 − p , x = 1 x = 0 x ∈ { 0 , 1 } x \in \{0, 1\} x ∈ { 0 , 1 } p p p p ( 1 − p ) p(1-p) p ( 1 − p ) Binomial ( n , p ) (n, p) ( n , p ) ( n k ) p k ( 1 − p ) n − k \displaystyle \binom{n}{k} p^k (1-p)^{n-k} ( k n ) p k ( 1 − p ) n − k k = 0 , 1 , . . . , n k = 0, 1, ..., n k = 0 , 1 , ... , n n p np n p n p ( 1 − p ) np(1-p) n p ( 1 − p ) Geometric ( p ) (p) ( p ) ( 1 − p ) k − 1 p (1-p)^{k-1}p ( 1 − p ) k − 1 p k = 1 , 2 , . . . k = 1, 2, ... k = 1 , 2 , ... 1 p \dfrac{1}{p} p 1 1 − p p 2 \dfrac{1-p}{p^2} p 2 1 − p Poisson ( λ ) (\lambda) ( λ ) e − λ λ k k ! \dfrac{e^{-\lambda}\lambda^k}{k!} k ! e − λ λ k k = 0 , 1 , . . . k = 0, 1, ... k = 0 , 1 , ... λ \lambda λ λ \lambda λ Uniform ( a , b ) (a, b) ( a , b ) 1 b − a \dfrac{1}{b-a} b − a 1 x ∈ [ a , b ] x \in [a, b] x ∈ [ a , b ] a + b 2 \dfrac{a+b}{2} 2 a + b ( b − a ) 2 12 \dfrac{(b-a)^2}{12} 12 ( b − a ) 2 Gaussian ( μ , σ 2 ) (\mu, \sigma^2) ( μ , σ 2 ) 1 2 π σ 2 e − ( x − μ ) 2 2 σ 2 \dfrac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} 2 π σ 2 1 e − 2 σ 2 ( x − μ ) 2 x ∈ ( − ∞ , ∞ ) x \in (-\infty, \infty) x ∈ ( − ∞ , ∞ ) μ \mu μ σ 2 \sigma^2 σ 2 Exponential ( λ ) (\lambda) ( λ ) λ e − λ x \lambda e^{-\lambda x} λ e − λ x x ≥ 0 x \geq 0 x ≥ 0 1 λ \dfrac{1}{\lambda} λ 1 1 λ 2 \dfrac{1}{\lambda^2} λ 2 1
Notes:
The Geometric distribution here uses the convention "number of trials until first success," so k k k starts at 1. The Poisson distribution models counts of events in a fixed interval. The Uniform distribution models a constant density over [ a , b ] [a, b] [ a , b ] . The Gaussian is also called the normal distribution. The Exponential distribution is often used for modeling waiting times. For multivariate Gaussian:
X ∼ N ( μ , Σ ) \mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}) X ∼ N ( μ , Σ ) PDF: f ( x ) = 1 ( 2 π ) n / 2 ∣ Σ ∣ 1 / 2 exp ( − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) ) f(\mathbf{x}) = \dfrac{1}{(2\pi)^{n/2}|\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\tfrac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu})\right) f ( x ) = ( 2 π ) n /2 ∣ Σ ∣ 1/2 1 exp ( − 2 1 ( x − μ ) T Σ − 1 ( x − μ ) ) Mean: μ \boldsymbol{\mu} μ , Covariance: Σ \boldsymbol{\Sigma} Σ Maximum Likelihood Estimation (MLE) Maximum Likelihood Estimation is a method for estimating parameters of a statistical model by finding the parameter values that maximize the probability of observing the given data.
Intuition Given data D = { x 1 , . . . , x n } D = \{x_1, ..., x_n\} D = { x 1 , ... , x n } , MLE asks: "Which parameter values θ \theta θ make this observed data most likely?"
For example, if you flip a coin 10 times and get 7 heads, MLE would estimate p = 0.7 p = 0.7 p = 0.7 because that makes your observed outcome most probable.
Likelihood Function :
The likelihood of parameters θ \theta θ given data D D D (assuming i.i.d. observations):
L ( θ ) = P ( D ∣ θ ) = ∏ i = 1 n P ( x i ∣ θ ) L(\theta) = P(D|\theta) = \prod_{i=1}^n P(x_i|\theta) L ( θ ) = P ( D ∣ θ ) = i = 1 ∏ n P ( x i ∣ θ ) Log-Likelihood (easier to work with):
Taking logarithm converts products to sums and is monotonic (doesn't change the argmax):
ℓ ( θ ) = log L ( θ ) = ∑ i = 1 n log P ( x i ∣ θ ) \ell(\theta) = \log L(\theta) = \sum_{i=1}^n \log P(x_i|\theta) ℓ ( θ ) = log L ( θ ) = i = 1 ∑ n log P ( x i ∣ θ ) MLE Estimate :
Find θ \theta θ that maximizes the log-likelihood:
θ ^ M L E = arg max θ ℓ ( θ ) \hat{\theta}_{MLE} = \arg\max_\theta \ell(\theta) θ ^ M L E = arg θ max ℓ ( θ ) Finding MLE Method 1: Set derivative to zero
∂ ℓ ( θ ) ∂ θ = 0 \frac{\partial \ell(\theta)}{\partial \theta} = 0 ∂ θ ∂ ℓ ( θ ) = 0 Solve for θ \theta θ (if closed-form solution exists).
Method 2: Numerical optimization
Use gradient descent or other optimization algorithms when no closed-form solution exists.
Example: Gaussian Distribution For data from N ( μ , σ 2 ) \mathcal{N}(\mu, \sigma^2) N ( μ , σ 2 ) , the MLE estimates are:
μ ^ M L E = 1 n ∑ i = 1 n x i (sample mean) \hat{\mu}_{MLE} = \frac{1}{n}\sum_{i=1}^n x_i \quad \text{(sample mean)} μ ^ M L E = n 1 i = 1 ∑ n x i (sample mean) σ ^ M L E 2 = 1 n ∑ i = 1 n ( x i − μ ^ ) 2 (sample variance) \hat{\sigma}^2_{MLE} = \frac{1}{n}\sum_{i=1}^n (x_i - \hat{\mu})^2 \quad \text{(sample variance)} σ ^ M L E 2 = n 1 i = 1 ∑ n ( x i − μ ^ ) 2 (sample variance) Properties of MLE ✅ Consistent : Converges to true parameter as n → ∞ n \to \infty n → ∞
✅ Asymptotically normal : Distribution approaches Gaussian for large n n n
✅ Asymptotically efficient : Achieves lowest possible variance (Cramér-Rao bound)
✅ Invariant : If θ ^ M L E \hat{\theta}_{MLE} θ ^ M L E is MLE for θ \theta θ , then g ( θ ^ M L E ) g(\hat{\theta}_{MLE}) g ( θ ^ M L E ) is MLE for g ( θ ) g(\theta) g ( θ )
MLE vs MAP Aspect MLE MAP Formula arg max θ p ( D ∥ θ ) \arg\max_\theta \, p(D\|\theta) arg max θ p ( D ∥ θ ) arg max θ p ( D ∥ θ ) ⋅ p ( θ ) \arg\max_\theta \, p(D\|\theta) \cdot p(\theta) arg max θ p ( D ∥ θ ) ⋅ p ( θ ) Prior No prior (uniform prior) Incorporates prior p ( θ ) p(\theta) p ( θ ) Interpretation Most likely parameters given data Most likely parameters given data and prior belief Regularization No regularization Prior acts as regularization Special case MAP with uniform prior = MLE Includes MLE as special case
Connection : MLE is equivalent to MAP with a uniform (non-informative) prior on θ \theta θ .