Skip to main content

Probability and Statistics

This guide covers the essential probability and statistics concepts needed for machine learning, including distributions, expected values, Bayes' theorem, and maximum likelihood estimation.

Probability Basics

Probability Rules:

P(AB)=P(A)+P(B)P(AB)P(AB)=P(AB)P(B)=P(BA)P(A)\begin{align} P(A \cup B) &= P(A) + P(B) - P(A \cap B) \\ P(A \cap B) &= P(A|B)P(B) = P(B|A)P(A) \end{align}

Conditional Probability:

P(AB)=P(AB)P(B)P(A|B) = \frac{P(A \cap B)}{P(B)}

Independence: P(AB)=P(A)P(B)P(A \cap B) = P(A)P(B)

Bayes' Theorem

P(AB)=P(BA)P(A)P(B)P(A|B) = \frac{P(B|A)P(A)}{P(B)}

Extended Form:

P(AB)=P(BA)P(A)P(BA)P(A)+P(B¬A)P(¬A)P(A|B) = \frac{P(B|A)P(A)}{P(B|A)P(A) + P(B|\neg A)P(\neg A)}

ML Context (posterior, likelihood, prior, evidence):

P(θD)=P(Dθ)P(θ)P(D)P(\theta|D) = \frac{P(D|\theta)P(\theta)}{P(D)}

Expected Value and Variance

Expected Value (Mean):

E[X]=μ=ixiP(xi)(discrete)\mathbb{E}[X] = \mu = \sum_{i} x_i P(x_i) \quad \text{(discrete)}
E[X]=xf(x)dx(continuous)\mathbb{E}[X] = \int x f(x) dx \quad \text{(continuous)}

Properties:

  • E[aX+b]=aE[X]+b\mathbb{E}[aX + b] = a\mathbb{E}[X] + b
  • E[X+Y]=E[X]+E[Y]\mathbb{E}[X + Y] = \mathbb{E}[X] + \mathbb{E}[Y]

Variance:

Var(X)=σ2=E[(Xμ)2]=E[X2](E[X])2\text{Var}(X) = \sigma^2 = \mathbb{E}[(X - \mu)^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2

Properties:

  • Var(aX+b)=a2Var(X)\text{Var}(aX + b) = a^2\text{Var}(X)
  • If XX and YY are independent: Var(X+Y)=Var(X)+Var(Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)

Standard Deviation: σ=Var(X)\sigma = \sqrt{\text{Var}(X)}

Covariance:

Cov(X,Y)=E[(XμX)(YμY)]=E[XY]E[X]E[Y]\text{Cov}(X, Y) = \mathbb{E}[(X - \mu_X)(Y - \mu_Y)] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]

Correlation:

ρXY=Cov(X,Y)σXσY\rho_{XY} = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}

where 1ρ1-1 \leq \rho \leq 1.

Common Probability Distributions

Bernoulli Distribution

Models a single binary trial (coin flip).

P(X=x)=px(1p)1x,x{0,1}P(X = x) = p^x(1-p)^{1-x}, \quad x \in \{0, 1\}
  • Mean: E[X]=p\mathbb{E}[X] = p
  • Variance: Var(X)=p(1p)\text{Var}(X) = p(1-p)

Binomial Distribution

Number of successes in nn Bernoulli trials.

P(X=k)=(nk)pk(1p)nkP(X = k) = \binom{n}{k}p^k(1-p)^{n-k}
  • Mean: E[X]=np\mathbb{E}[X] = np
  • Variance: Var(X)=np(1p)\text{Var}(X) = np(1-p)

Gaussian (Normal) Distribution

f(x)=12πσ2exp((xμ)22σ2)f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)

Notation: XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2)

  • Mean: E[X]=μ\mathbb{E}[X] = \mu
  • Variance: Var(X)=σ2\text{Var}(X) = \sigma^2

Standard Normal: N(0,1)\mathcal{N}(0, 1)

Properties:

  • Linear combinations of Gaussians are Gaussian
  • If XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2), then aX+bN(aμ+b,a2σ2)aX + b \sim \mathcal{N}(a\mu + b, a^2\sigma^2)

Multivariate Gaussian

f(x)=1(2π)n/2Σ1/2exp(12(xμ)TΣ1(xμ))f(\mathbf{x}) = \frac{1}{(2\pi)^{n/2}|\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T\boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right)

Notation: XN(μ,Σ)\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})

  • Mean: E[X]=μ\mathbb{E}[\mathbf{X}] = \boldsymbol{\mu}
  • Covariance: Cov(X)=Σ\text{Cov}(\mathbf{X}) = \boldsymbol{\Sigma}

Exponential Distribution

Models time between events in a Poisson process.

f(x)=λeλx,x0f(x) = \lambda e^{-\lambda x}, \quad x \geq 0
  • Mean: E[X]=1λ\mathbb{E}[X] = \frac{1}{\lambda}
  • Variance: Var(X)=1λ2\text{Var}(X) = \frac{1}{\lambda^2}

Uniform Distribution

f(x)=1ba,x[a,b]f(x) = \frac{1}{b-a}, \quad x \in [a, b]
  • Mean: E[X]=a+b2\mathbb{E}[X] = \frac{a+b}{2}
  • Variance: Var(X)=(ba)212\text{Var}(X) = \frac{(b-a)^2}{12}

Maximum Likelihood Estimation (MLE)

Given data D={x1,...,xn}D = \{x_1, ..., x_n\} and model parameter θ\theta:

Likelihood:

L(θ)=P(Dθ)=i=1nP(xiθ)L(\theta) = P(D|\theta) = \prod_{i=1}^n P(x_i|\theta)

Log-Likelihood (easier to work with):

(θ)=logL(θ)=i=1nlogP(xiθ)\ell(\theta) = \log L(\theta) = \sum_{i=1}^n \log P(x_i|\theta)

MLE: Find θ\theta that maximizes (θ)\ell(\theta):

θ^MLE=argmaxθ(θ)\hat{\theta}_{MLE} = \arg\max_\theta \ell(\theta)

Example: For Gaussian N(μ,σ2)\mathcal{N}(\mu, \sigma^2):

μ^MLE=1ni=1nxi,σ^MLE2=1ni=1n(xiμ^)2\hat{\mu}_{MLE} = \frac{1}{n}\sum_{i=1}^n x_i, \quad \hat{\sigma}^2_{MLE} = \frac{1}{n}\sum_{i=1}^n (x_i - \hat{\mu})^2