Skip to main content

Generative Learning Algorithms

Generative learning algorithms take a different approach compared to discriminative algorithms. Instead of learning P(yx)P(y|x) directly (the probability of yy given xx), generative algorithms model:

  1. P(xy)P(x|y): The probability of features xx given class yy
  2. P(y)P(y): The prior probability of each class

Then use Bayes' Rule to compute P(yx)P(y|x):

P(yx)=P(xy)P(y)P(x)P(y|x) = \frac{P(x|y)P(y)}{P(x)}

Discriminative vs Generative

Discriminative Algorithms

  • Learn P(yx)P(y|x) directly
  • Examples: Logistic Regression, SVM, Neural Networks
  • Focus on the decision boundary

Generative Algorithms

  • Model P(xy)P(x|y) and P(y)P(y)
  • Examples: Gaussian Discriminant Analysis, Naive Bayes
  • Can generate new data samples
  • Often work well with less training data

Gaussian Discriminant Analysis (GDA)

GDA assumes that P(xy)P(x|y) follows a multivariate Gaussian distribution.

Multivariate Gaussian Distribution

For a random vector xRnx \in \mathbb{R}^n:

P(x)=1(2π)n/2Σ1/2exp(12(xμ)TΣ1(xμ))P(x) = \frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}} \exp\left(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\right)

where:

  • μRn\mu \in \mathbb{R}^n is the mean vector
  • ΣRn×n\Sigma \in \mathbb{R}^{n \times n} is the covariance matrix
  • Σ|\Sigma| is the determinant of Σ\Sigma

We write this as xN(μ,Σ)x \sim \mathcal{N}(\mu, \Sigma).

GDA Model for Binary Classification

For y{0,1}y \in \{0, 1\}:

yBernoulli(ϕ)xy=0N(μ0,Σ)xy=1N(μ1,Σ)\begin{align} y &\sim \text{Bernoulli}(\phi) \\ x|y=0 &\sim \mathcal{N}(\mu_0, \Sigma) \\ x|y=1 &\sim \mathcal{N}(\mu_1, \Sigma) \end{align}

Parameters to estimate:

  • ϕ=P(y=1)\phi = P(y=1): Prior probability
  • μ0\mu_0: Mean of class 0
  • μ1\mu_1: Mean of class 1
  • Σ\Sigma: Shared covariance matrix

Maximum Likelihood Estimation

Given training data {(x(i),y(i))}i=1m\{(x^{(i)}, y^{(i)})\}_{i=1}^m, the parameters are:

ϕ=1mi=1m1{y(i)=1}μ0=i=1m1{y(i)=0}x(i)i=1m1{y(i)=0}μ1=i=1m1{y(i)=1}x(i)i=1m1{y(i)=1}Σ=1mi=1m(x(i)μy(i))(x(i)μy(i))T\begin{align} \phi &= \frac{1}{m} \sum_{i=1}^{m} \mathbb{1}\{y^{(i)}=1\} \\ \mu_0 &= \frac{\sum_{i=1}^{m} \mathbb{1}\{y^{(i)}=0\} x^{(i)}}{\sum_{i=1}^{m} \mathbb{1}\{y^{(i)}=0\}} \\ \mu_1 &= \frac{\sum_{i=1}^{m} \mathbb{1}\{y^{(i)}=1\} x^{(i)}}{\sum_{i=1}^{m} \mathbb{1}\{y^{(i)}=1\}} \\ \Sigma &= \frac{1}{m} \sum_{i=1}^{m} (x^{(i)} - \mu_{y^{(i)}})(x^{(i)} - \mu_{y^{(i)}})^T \end{align}

Making Predictions

To classify a new example xx, compute:

argmaxyP(yx)=argmaxyP(xy)P(y)P(x)=argmaxyP(xy)P(y)\arg\max_y P(y|x) = \arg\max_y \frac{P(x|y)P(y)}{P(x)} = \arg\max_y P(x|y)P(y)

Since P(x)P(x) is the same for all classes, we can ignore it.

GDA vs Logistic Regression

  • GDA makes stronger assumptions (Gaussian distribution)
  • If assumptions are correct, GDA is more efficient (needs less data)
  • Logistic regression is more robust and works well even if assumptions are violated
  • GDA decision boundary is linear (same as logistic regression)

Naive Bayes

Naive Bayes is particularly useful when features are discrete (e.g., text classification, spam detection).

The Naive Bayes Assumption

Assumption: Features are conditionally independent given the class label.

For features x1,x2,...,xnx_1, x_2, ..., x_n:

P(x1,...,xny)=P(x1y)P(x2y)...P(xny)=j=1nP(xjy)P(x_1, ..., x_n|y) = P(x_1|y) \cdot P(x_2|y) \cdot ... \cdot P(x_n|y) = \prod_{j=1}^{n} P(x_j|y)

This assumption is "naive" because features are often correlated, but it works surprisingly well in practice!

Naive Bayes Classifier

Using Bayes' rule:

P(yx)=P(xy)P(y)P(x)=P(y)j=1nP(xjy)P(x)P(y|x) = \frac{P(x|y)P(y)}{P(x)} = \frac{P(y) \prod_{j=1}^{n} P(x_j|y)}{P(x)}

Prediction:

y^=argmaxyP(y)j=1nP(xjy)\hat{y} = \arg\max_y P(y) \prod_{j=1}^{n} P(x_j|y)

Types of Naive Bayes

1. Multinomial Naive Bayes

For discrete features (e.g., word counts in text):

P(xjy)=count(xj,y)+αcount(y)+αnP(x_j|y) = \frac{\text{count}(x_j, y) + \alpha}{\text{count}(y) + \alpha n}

where α\alpha is a smoothing parameter (Laplace smoothing).

Use case: Text classification, document categorization

2. Gaussian Naive Bayes

For continuous features (assumes Gaussian distribution):

P(xjy)=12πσy2exp((xjμy)22σy2)P(x_j|y) = \frac{1}{\sqrt{2\pi\sigma_y^2}} \exp\left(-\frac{(x_j - \mu_y)^2}{2\sigma_y^2}\right)

Use case: Any problem with continuous features

3. Bernoulli Naive Bayes

For binary features:

P(xjy)=P(xj=1y)xj+(1P(xj=1y))(1xj)P(x_j|y) = P(x_j=1|y) \cdot x_j + (1 - P(x_j=1|y)) \cdot (1-x_j)

Use case: Text classification with binary occurrence features

Laplace Smoothing

To avoid zero probabilities for unseen feature-class combinations, we add a small constant α\alpha (typically 1):

P(xjy)=count(xj,y)+αcount(y)+αnP(x_j|y) = \frac{\text{count}(x_j, y) + \alpha}{\text{count}(y) + \alpha n}

Example: Spam Classification

Let's classify emails as spam or not spam using Naive Bayes.

Setup

  • Vocabulary: 10,000 words
  • Each email is represented as a vector x{0,1}10000x \in \{0,1\}^{10000}
  • xj=1x_j = 1 if word jj appears in the email, 0 otherwise
  • y{0,1}y \in \{0, 1\} where y=1y=1 means spam

Training

Compute for each word jj and class yy:

ϕjy=1=P(xj=1y=1)ϕjy=0=P(xj=1y=0)ϕy=P(y=1)\begin{align} \phi_{j|y=1} &= P(x_j=1|y=1) \\ \phi_{j|y=0} &= P(x_j=1|y=0) \\ \phi_y &= P(y=1) \end{align}

Maximum likelihood estimates:

ϕjy=1=i=1m1{xj(i)=1y(i)=1}i=1m1{y(i)=1}ϕjy=0=i=1m1{xj(i)=1y(i)=0}i=1m1{y(i)=0}ϕy=i=1m1{y(i)=1}m\begin{align} \phi_{j|y=1} &= \frac{\sum_{i=1}^{m} \mathbb{1}\{x_j^{(i)}=1 \land y^{(i)}=1\}}{\sum_{i=1}^{m} \mathbb{1}\{y^{(i)}=1\}} \\ \phi_{j|y=0} &= \frac{\sum_{i=1}^{m} \mathbb{1}\{x_j^{(i)}=1 \land y^{(i)}=0\}}{\sum_{i=1}^{m} \mathbb{1}\{y^{(i)}=0\}} \\ \phi_y &= \frac{\sum_{i=1}^{m} \mathbb{1}\{y^{(i)}=1\}}{m} \end{align}

Prediction

For a new email xx:

argmaxyP(y)j=110000P(xjy)\arg\max_y P(y) \prod_{j=1}^{10000} P(x_j|y)

In practice, compute log probabilities to avoid numerical underflow:

argmaxylogP(y)+j=110000logP(xjy)\arg\max_y \log P(y) + \sum_{j=1}^{10000} \log P(x_j|y)

Advantages and Limitations

Advantages

  • Fast to train and predict
  • Works well with small datasets
  • Can handle high-dimensional data
  • Simple and interpretable
  • Works well for text classification
  • Can handle missing values easily

Limitations

  • Strong independence assumption (often violated in practice)
  • If assumption is badly violated, performance can suffer
  • Not great for regression tasks
  • Struggles with correlated features

Practical Tips

  1. Feature engineering is important:

    • Remove highly correlated features
    • Use domain knowledge to create meaningful features
  2. Always use smoothing to handle zero probabilities

  3. Log probabilities prevent numerical underflow:

    log_prob = np.log(prior) + np.sum(np.log(likelihoods))
  4. Try different Naive Bayes variants depending on your feature types

  5. For text:

    • Remove stop words
    • Use TF-IDF instead of raw counts
    • Consider n-grams
  6. Naive Bayes is a great baseline model - try it first before complex models!

When to Use Which Algorithm?

AlgorithmBest ForAssumptions
GDAContinuous features, Gaussian-like dataFeatures are Gaussian distributed
Gaussian NBContinuous features, fast training neededFeatures are independent and Gaussian
Multinomial NBText classification, discrete countsFeatures are independent counts
Bernoulli NBBinary features, presence/absenceFeatures are independent binary

Key Takeaway: Generative models learn the distribution of data for each class and use Bayes' rule to make predictions. They often work well with less data but make stronger assumptions than discriminative models.