Generative Learning Algorithms
Generative learning algorithms take a different approach compared to discriminative algorithms. Instead of learning directly (the probability of given ), generative algorithms model:
- : The probability of features given class
- : The prior probability of each class
Then use Bayes' Rule to compute :
Discriminative vs Generative
Discriminative Algorithms
- Learn directly
- Examples: Logistic Regression, SVM, Neural Networks
- Focus on the decision boundary
Generative Algorithms
- Model and
- Examples: Gaussian Discriminant Analysis, Naive Bayes
- Can generate new data samples
- Often work well with less training data
Gaussian Discriminant Analysis (GDA)
GDA assumes that follows a multivariate Gaussian distribution.
Multivariate Gaussian Distribution
For a random vector :
where:
- is the mean vector
- is the covariance matrix
- is the determinant of
We write this as .
GDA Model for Binary Classification
For :
Parameters to estimate:
- : Prior probability
- : Mean of class 0
- : Mean of class 1
- : Shared covariance matrix
Maximum Likelihood Estimation
Given training data , the parameters are:
Making Predictions
To classify a new example , compute:
Since is the same for all classes, we can ignore it.
GDA vs Logistic Regression
- GDA makes stronger assumptions (Gaussian distribution)
- If assumptions are correct, GDA is more efficient (needs less data)
- Logistic regression is more robust and works well even if assumptions are violated
- GDA decision boundary is linear (same as logistic regression)
Naive Bayes
Naive Bayes is particularly useful when features are discrete (e.g., text classification, spam detection).
The Naive Bayes Assumption
Assumption: Features are conditionally independent given the class label.
For features :
This assumption is "naive" because features are often correlated, but it works surprisingly well in practice!
Naive Bayes Classifier
Using Bayes' rule:
Prediction:
Types of Naive Bayes
1. Multinomial Naive Bayes
For discrete features (e.g., word counts in text):
where is a smoothing parameter (Laplace smoothing).
Use case: Text classification, document categorization
2. Gaussian Naive Bayes
For continuous features (assumes Gaussian distribution):
Use case: Any problem with continuous features
3. Bernoulli Naive Bayes
For binary features:
Use case: Text classification with binary occurrence features
Laplace Smoothing
To avoid zero probabilities for unseen feature-class combinations, we add a small constant (typically 1):
Example: Spam Classification
Let's classify emails as spam or not spam using Naive Bayes.
Setup
- Vocabulary: 10,000 words
- Each email is represented as a vector
- if word appears in the email, 0 otherwise
- where means spam
Training
Compute for each word and class :
Maximum likelihood estimates:
Prediction
For a new email :
In practice, compute log probabilities to avoid numerical underflow:
Advantages and Limitations
Advantages
- Fast to train and predict
- Works well with small datasets
- Can handle high-dimensional data
- Simple and interpretable
- Works well for text classification
- Can handle missing values easily
Limitations
- Strong independence assumption (often violated in practice)
- If assumption is badly violated, performance can suffer
- Not great for regression tasks
- Struggles with correlated features
Practical Tips
Feature engineering is important:
- Remove highly correlated features
- Use domain knowledge to create meaningful features
Always use smoothing to handle zero probabilities
Log probabilities prevent numerical underflow:
log_prob = np.log(prior) + np.sum(np.log(likelihoods))Try different Naive Bayes variants depending on your feature types
For text:
- Remove stop words
- Use TF-IDF instead of raw counts
- Consider n-grams
Naive Bayes is a great baseline model - try it first before complex models!
When to Use Which Algorithm?
| Algorithm | Best For | Assumptions |
|---|---|---|
| GDA | Continuous features, Gaussian-like data | Features are Gaussian distributed |
| Gaussian NB | Continuous features, fast training needed | Features are independent and Gaussian |
| Multinomial NB | Text classification, discrete counts | Features are independent counts |
| Bernoulli NB | Binary features, presence/absence | Features are independent binary |
Key Takeaway: Generative models learn the distribution of data for each class and use Bayes' rule to make predictions. They often work well with less data but make stronger assumptions than discriminative models.