Skip to main content

Statistics

Probability Basics

Probability Rules:

P(AB)=P(A)+P(B)P(AB)P(AB)=P(AB)P(B)=P(BA)P(A)\begin{align} P(A \cup B) &= P(A) + P(B) - P(A \cap B) \\ P(A \cap B) &= P(A|B)P(B) = P(B|A)P(A) \end{align}

Conditional Probability:

P(AB)=P(AB)P(B)P(A|B) = \frac{P(A \cap B)}{P(B)}

Independence: P(AB)=P(A)P(B)P(A \cap B) = P(A)P(B)

Bayes' Theorem

P(AB)=P(BA)P(A)P(B)P(A|B) = \frac{P(B|A)P(A)}{P(B)}

Extended Form:

P(AB)=P(BA)P(A)P(BA)P(A)+P(B¬A)P(¬A)P(A|B) = \frac{P(B|A)P(A)}{P(B|A)P(A) + P(B|\neg A)P(\neg A)}

ML Context (Parameter Estimation):

In machine learning, we want to estimate parameters θ\theta given observed data DD:

p(θD)=p(Dθ)p(θ)p(D)p(\theta|D) = \frac{p(D|\theta) \cdot p(\theta)}{p(D)}

where:

  • p(θD)p(\theta|D) is the posterior (probability of parameters given data)
  • p(Dθ)p(D|\theta) is the likelihood (probability of observing data given parameters)
  • p(θ)p(\theta) is the prior (prior belief about parameters before seeing data)
  • p(D)p(D) is the evidence or marginal likelihood (normalizing constant)

Law of Total Probability:

p(D)=p(Dθ)p(θ)dθp(D) = \int p(D|\theta) \cdot p(\theta) \, d\theta

Maximum A Posteriori (MAP)

MAP estimation finds the most likely parameter values given the data:

θ^MAP=argmaxθp(θD)=argmaxθp(Dθ)p(θ)p(D)\hat{\theta}_{MAP} = \arg\max_\theta \, p(\theta|D) = \arg\max_\theta \, \frac{p(D|\theta) \cdot p(\theta)}{p(D)}

Since p(D)p(D) doesn't depend on θ\theta, this simplifies to:

θ^MAP=argmaxθp(Dθ)p(θ)\hat{\theta}_{MAP} = \arg\max_\theta \, p(D|\theta) \cdot p(\theta)

Taking logarithm (for numerical stability):

θ^MAP=argmaxθ[logp(Dθ)+logp(θ)]\hat{\theta}_{MAP} = \arg\max_\theta \, [\log p(D|\theta) + \log p(\theta)]

Comparison with MLE:

  • MLE: θ^MLE=argmaxθp(Dθ)\hat{\theta}_{MLE} = \arg\max_\theta \, p(D|\theta) (ignores prior)
  • MAP: θ^MAP=argmaxθp(Dθ)p(θ)\hat{\theta}_{MAP} = \arg\max_\theta \, p(D|\theta) \cdot p(\theta) (incorporates prior)
  • If prior is uniform, MAP = MLE

Application to Naive Bayes Classification

For classification with Naive Bayes, we want to find:

y^=argmaxy{0,1}p(yx)=argmaxy{0,1}p(xy)p(y)p(x)\hat{y} = \arg\max_{y \in \{0,1\}} \, p(y|x) = \arg\max_{y \in \{0,1\}} \, \frac{p(x|y) \cdot p(y)}{p(x)}

Since p(x)p(x) is constant for all classes, this becomes:

y^=argmaxy{0,1}p(xy)p(y)\hat{y} = \arg\max_{y \in \{0,1\}} \, p(x|y) \cdot p(y)

This is MAP estimation where:

  • yy is the parameter we're estimating (the class label)
  • p(y)p(y) is the prior (class prior, e.g., p(y=1)=ϕp(y=1) = \phi)
  • p(xy)p(x|y) is the likelihood (computed using Naive Bayes assumption: p(xy)=j=1dp(xjy)p(x|y) = \prod_{j=1}^d p(x_j|y))
  • We pick the class that maximizes the posterior

Why MAP works for Naive Bayes:

  1. We treat the class label yy as the "parameter" to estimate
  2. The prior p(y)p(y) represents class frequencies in training data
  3. The likelihood p(xy)p(x|y) comes from the Naive Bayes assumption
  4. We choose the class with maximum posterior probability

Extended Form (binary classification):

p(y=1x)=p(xy=1)p(y=1)p(xy=1)p(y=1)+p(xy=0)p(y=0)p(y=1|x) = \frac{p(x|y=1) \cdot p(y=1)}{p(x|y=1) \cdot p(y=1) + p(x|y=0) \cdot p(y=0)}

where p(y=1)=ϕp(y=1) = \phi and p(y=0)=1ϕp(y=0) = 1-\phi.

Expected Value and Variance

Expected Value (Mean):

E[X]=μ=ixiP(xi)(discrete)\mathbb{E}[X] = \mu = \sum_{i} x_i P(x_i) \quad \text{(discrete)}
E[X]=xf(x)dx(continuous)\mathbb{E}[X] = \int x f(x) dx \quad \text{(continuous)}

Properties:

  • E[aX+b]=aE[X]+b\mathbb{E}[aX + b] = a\mathbb{E}[X] + b
  • E[X+Y]=E[X]+E[Y]\mathbb{E}[X + Y] = \mathbb{E}[X] + \mathbb{E}[Y]
  • Jensen's inequality: E[f(X)]f(E[X])\mathbb{E}[f(X)] \geq f(\mathbb{E}[X]) for convex function ff.

Variance:

Var(X)=σ2=E[(Xμ)2]=E[X2](E[X])2\text{Var}(X) = \sigma^2 = \mathbb{E}[(X - \mu)^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2

Properties:

  • Var(aX+b)=a2Var(X)\text{Var}(aX + b) = a^2\text{Var}(X)
  • If XX and YY are independent: Var(X+Y)=Var(X)+Var(Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)
  • Variance is always non-negative: Var(X)=E[X2](E[X])20\text{Var}(X) = \mathbb{E}[X^{2}] - (\mathbb{E}[X])^{2} \geq 0.

Standard Deviation: σ=Var(X)\sigma = \sqrt{\text{Var}(X)}

Covariance:

Cov(X,Y)=E[(XμX)(YμY)]=E[XY]E[X]E[Y]\text{Cov}(X, Y) = \mathbb{E}[(X - \mu_X)(Y - \mu_Y)] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]

Correlation:

ρXY=Cov(X,Y)σXσY\rho_{XY} = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}

where 1ρ1-1 \leq \rho \leq 1.

Common Probability Distributions

DistributionPDF or PMFSupport / ValuesMeanVariance
Bernoulli(p)(p){p,x=11p,x=0\begin{cases} p, & x=1 \\ 1-p, & x=0 \end{cases}x{0,1}x \in \{0, 1\}ppp(1p)p(1-p)
Binomial(n,p)(n, p)(nk)pk(1p)nk\displaystyle \binom{n}{k} p^k (1-p)^{n-k}k=0,1,...,nk = 0, 1, ..., nnpnpnp(1p)np(1-p)
Geometric(p)(p)(1p)k1p(1-p)^{k-1}pk=1,2,...k = 1, 2, ...1p\dfrac{1}{p}1pp2\dfrac{1-p}{p^2}
Poisson(λ)(\lambda)eλλkk!\dfrac{e^{-\lambda}\lambda^k}{k!}k=0,1,...k = 0, 1, ...λ\lambdaλ\lambda
Uniform(a,b)(a, b)1ba\dfrac{1}{b-a}x[a,b]x \in [a, b]a+b2\dfrac{a+b}{2}(ba)212\dfrac{(b-a)^2}{12}
Gaussian(μ,σ2)(\mu, \sigma^2)12πσ2e(xμ)22σ2\dfrac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}x(,)x \in (-\infty, \infty)μ\muσ2\sigma^2
Exponential(λ)(\lambda)λeλx\lambda e^{-\lambda x}x0x \geq 01λ\dfrac{1}{\lambda}1λ2\dfrac{1}{\lambda^2}

Notes:

  • The Geometric distribution here uses the convention "number of trials until first success," so kk starts at 1.
  • The Poisson distribution models counts of events in a fixed interval.
  • The Uniform distribution models a constant density over [a,b][a, b].
  • The Gaussian is also called the normal distribution.
  • The Exponential distribution is often used for modeling waiting times.

For multivariate Gaussian:

  • XN(μ,Σ)\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})
  • PDF: f(x)=1(2π)n/2Σ1/2exp(12(xμ)TΣ1(xμ))f(\mathbf{x}) = \dfrac{1}{(2\pi)^{n/2}|\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\tfrac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu})\right)
  • Mean: μ\boldsymbol{\mu}, Covariance: Σ\boldsymbol{\Sigma}

Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation is a method for estimating parameters of a statistical model by finding the parameter values that maximize the probability of observing the given data.

Intuition

Given data D={x1,...,xn}D = \{x_1, ..., x_n\}, MLE asks: "Which parameter values θ\theta make this observed data most likely?"

For example, if you flip a coin 10 times and get 7 heads, MLE would estimate p=0.7p = 0.7 because that makes your observed outcome most probable.

Mathematical Formulation

Likelihood Function:

The likelihood of parameters θ\theta given data DD (assuming i.i.d. observations):

L(θ)=P(Dθ)=i=1nP(xiθ)L(\theta) = P(D|\theta) = \prod_{i=1}^n P(x_i|\theta)

Log-Likelihood (easier to work with):

Taking logarithm converts products to sums and is monotonic (doesn't change the argmax):

(θ)=logL(θ)=i=1nlogP(xiθ)\ell(\theta) = \log L(\theta) = \sum_{i=1}^n \log P(x_i|\theta)

MLE Estimate:

Find θ\theta that maximizes the log-likelihood:

θ^MLE=argmaxθ(θ)\hat{\theta}_{MLE} = \arg\max_\theta \ell(\theta)

Finding MLE

Method 1: Set derivative to zero

(θ)θ=0\frac{\partial \ell(\theta)}{\partial \theta} = 0

Solve for θ\theta (if closed-form solution exists).

Method 2: Numerical optimization

Use gradient descent or other optimization algorithms when no closed-form solution exists.

Example: Gaussian Distribution

For data from N(μ,σ2)\mathcal{N}(\mu, \sigma^2), the MLE estimates are:

μ^MLE=1ni=1nxi(sample mean)\hat{\mu}_{MLE} = \frac{1}{n}\sum_{i=1}^n x_i \quad \text{(sample mean)}
σ^MLE2=1ni=1n(xiμ^)2(sample variance)\hat{\sigma}^2_{MLE} = \frac{1}{n}\sum_{i=1}^n (x_i - \hat{\mu})^2 \quad \text{(sample variance)}

Properties of MLE

Consistent: Converges to true parameter as nn \to \infty
Asymptotically normal: Distribution approaches Gaussian for large nn
Asymptotically efficient: Achieves lowest possible variance (Cramér-Rao bound)
Invariant: If θ^MLE\hat{\theta}_{MLE} is MLE for θ\theta, then g(θ^MLE)g(\hat{\theta}_{MLE}) is MLE for g(θ)g(\theta)

MLE vs MAP

AspectMLEMAP
Formulaargmaxθp(Dθ)\arg\max_\theta \, p(D\|\theta)argmaxθp(Dθ)p(θ)\arg\max_\theta \, p(D\|\theta) \cdot p(\theta)
PriorNo prior (uniform prior)Incorporates prior p(θ)p(\theta)
InterpretationMost likely parameters given dataMost likely parameters given data and prior belief
RegularizationNo regularizationPrior acts as regularization
Special caseMAP with uniform prior = MLEIncludes MLE as special case

Connection: MLE is equivalent to MAP with a uniform (non-informative) prior on θ\theta.