Skip to main content

Gaussian Discriminant Analysis (GDA)

A generative learning algorithm that models class-conditional distributions as multivariate Gaussians.

Key Assumptions

  1. Class-conditional distributions are multivariate Gaussian: P(xy)N(μy,Σ)P(x|y) \sim \mathcal{N}(\mu_y, \Sigma)
  2. Shared covariance matrix: Both classes use the same Σ\Sigma (Quadratic Discriminant Analysis (QDA) uses different Σy\Sigma_y per class)
  3. Covariance matrix properties: Σ\Sigma must be symmetric and positive semi-definite (PSD):
    • Symmetric: Σ=ΣT\Sigma = \Sigma^T
    • PSD: zTΣz0\mathbf{z}^T\Sigma\mathbf{z} \geq 0 for all z\mathbf{z}, or equivalently, all eigenvalues λi0\lambda_i \geq 0
    • For GDA to work properly, Σ\Sigma should be positive definite (PD) (invertible)

Model

For binary classification y{0,1}y \in \{0, 1\}:

yBernoulli(ϕ)xy=0N(μ0,Σ)xy=1N(μ1,Σ)\begin{align} y &\sim \text{Bernoulli}(\phi) \\ x|y=0 &\sim \mathcal{N}(\mu_0, \Sigma) \\ x|y=1 &\sim \mathcal{N}(\mu_1, \Sigma) \end{align}

Parameters: ϕ,μ0,μ1,Σ\phi, \mu_0, \mu_1, \Sigma

Maximum Likelihood Estimates

ϕ=1mi=1m1{y(i)=1}μ0=i=1m1{y(i)=0}x(i)i=1m1{y(i)=0}μ1=i=1m1{y(i)=1}x(i)i=1m1{y(i)=1}Σ=1mi=1m(x(i)μy(i))(x(i)μy(i))T\begin{align} \phi &= \frac{1}{m} \sum_{i=1}^{m} \mathbb{1}\{y^{(i)}=1\} \\ \mu_0 &= \frac{\sum_{i=1}^{m} \mathbb{1}\{y^{(i)}=0\} x^{(i)}}{\sum_{i=1}^{m} \mathbb{1}\{y^{(i)}=0\}} \\ \mu_1 &= \frac{\sum_{i=1}^{m} \mathbb{1}\{y^{(i)}=1\} x^{(i)}}{\sum_{i=1}^{m} \mathbb{1}\{y^{(i)}=1\}} \\ \Sigma &= \frac{1}{m} \sum_{i=1}^{m} (x^{(i)} - \mu_{y^{(i)}})(x^{(i)} - \mu_{y^{(i)}})^T \end{align}

Decision Boundary

GDA produces a linear decision boundary (when Σ0=Σ1\Sigma_0 = \Sigma_1).

The decision boundary is given by:

(μ1μ0)TΣ1x=12(μ1TΣ1μ1μ0TΣ1μ0)+log1ϕϕ(\mu_1 - \mu_0)^T \Sigma^{-1} x = \frac{1}{2}(\mu_1^T \Sigma^{-1} \mu_1 - \mu_0^T \Sigma^{-1} \mu_0) + \log\frac{1-\phi}{\phi}

This is linear in xx (affine function), making it the same form as logistic regression.

GDA vs Logistic Regression

AspectGDALogistic Regression (LR)
TypeGenerativeDiscriminative
AssumptionsStrong (Gaussian distributions)Weak (only needs linear decision boundary)
Data EfficiencyMore efficient when assumptions holdNeeds more data
RobustnessSensitive to assumption violationsMore robust to distribution violations
Decision BoundaryLinear (with shared Σ\Sigma)Linear
Parameter EstimationMLE of P(xy)P(x\|y) and P(y)P(y)MLE of P(yx)P(y\|x) directly
When to UseData truly Gaussian, small datasetLarge dataset, unknown distributions

Key Insight: GDA and LR produce the same linear decision boundary, but they estimate parameters differently:

  • GDA: Models P(xy)P(x|y) and P(y)P(y) → derives P(yx)P(y|x) via Bayes' rule
  • LR: Directly models P(yx)P(y|x) parametrically

Generalization and Reduction

Generalization: GDA is a special case that can be generalized:

  • QDA (Quadratic Discriminant Analysis): Different Σy\Sigma_y per class → produces quadratic boundaries
  • Naive Bayes: Independence assumption → Σ\Sigma is diagonal
  • General exponential family: Replace Gaussian with other distributions (see GLM)

Reduction to Special Cases:

  • If features are independent: Σ\Sigma becomes diagonal
  • If classes are balanced and means are equal: reduces to random guessing
  • If Σ\Sigma is identity matrix: decision boundary depends only on Euclidean distance to means

Why Covariance Must Be PSD

The covariance matrix Σ\Sigma must be PSD because:

  1. Variance is always non-negative: For any vector z\mathbf{z}, Var(zTx)=zTΣz0\text{Var}(\mathbf{z}^T x) = \mathbf{z}^T \Sigma \mathbf{z} \geq 0
  2. Inverse exists: For GDA, we need Σ1\Sigma^{-1}, so Σ\Sigma must be positive definite (PD) (strictly PSD)
  3. Physical interpretation: Covariance captures how features co-vary; negative variance is meaningless

Ensuring PSD: The MLE formula Σ=1mi=1m(x(i)μy(i))(x(i)μy(i))T\Sigma = \frac{1}{m} \sum_{i=1}^{m} (x^{(i)} - \mu_{y^{(i)}})(x^{(i)} - \mu_{y^{(i)}})^T automatically produces a PSD matrix because it's a sum of outer products (xμ)(xμ)T(x - \mu)(x - \mu)^T, which are PSD.