Naive Bayes
Naive Bayes is a generative learning algorithm based on Bayes' Theorem with the "naive" assumption that features are conditionally independent given the class label.
Algorithm Overview
For classification, Naive Bayes predicts the class that maximizes the posterior probability:
y^=argymaxP(y∣x)=argymaxP(y)i=1∏nP(xi∣y) where:
- P(y) = prior probability of class y
- P(xi∣y) = probability of feature xi given class y
- The "naive" assumption: P(x∣y)=∏i=1nP(xi∣y) (features are independent given class)
Training Procedure
Given training data {(x(i),y(i))}i=1m:
Estimate Prior Probabilities:
P(y=k)=mcount(y(i)=k) For binary classification: P(y=1)=mnumber of class 1 examples
Estimate Class-Conditional Probabilities:
For discrete features (e.g., binary, categorical):
P(xi=v∣y=k)=count(y=k)count(xi=v and y=k) For continuous features, typically assume Gaussian distribution:
P(xi∣y=k)=N(μik,σik2) where μik and σik2 are the mean and variance of feature i for class k.
Prediction Procedure
For a new example x=(x1,x2,…,xn):
Compute Unnormalized Scores for each class k:
score(k)=P(y=k)i=1∏nP(xi∣y=k) Predict the Class with highest score:
y^=argkmaxscore(k) Optional: Normalize to Get Probabilities:
P(y=k∣x)=∑jscore(j)score(k)
Example Calculation (from image):
Given x=(1,1,1,0,1,0,0,1,0) and classes {1,7}:
- Prior: P(y=1)=4/9, P(y=7)=5/9
- For class y=1: ∏i=19P(Xi∣y=1)=(1/4)⋅(2/4)⋅(3/4)⋅(3/4)⋅(1/4)⋅(1/4)⋅(3/4)⋅(1/4)⋅(2/4)=108/49
- Unnormalized: P(y=1)∏P(Xi∣y=1)=(4/9)⋅(108/49)=0.000183
- Unnormalized: P(y=7)∏P(Xi∣y=7)=(5/9)⋅(20736/59)=0.005898
- Normalized: P(y=1∣x)=0.03, P(y=7∣x)=0.97
- Prediction: y=7 (highest probability)
Laplace Smoothing (Add-α Smoothing)
Problem: If a feature value xi=v never occurs with class y=k in training, then P(xi=v∣y=k)=0, causing the entire product to be zero.
Solution: Add pseudo-counts to avoid zero probabilities:
P(xi=v∣y=k)=count(y=k)+α⋅∣values of xi∣count(xi=v and y=k)+α where:
- α>0 is the smoothing parameter (typically α=1 for Laplace smoothing)
- ∣values of xi∣ is the number of possible values feature xi can take
Special Case - Binary Features:
P(xi=1∣y=k)=count(y=k)+2αcount(xi=1 and y=k)+α With α=1 (Laplace smoothing), this ensures every probability is at least count(y=k)+21, preventing zeros.
Numerical Stability: Log Probabilities
Problem: Multiplying many small probabilities (each <1) causes numerical underflow (result becomes 0 due to limited floating-point precision).
Solution: Work in log space to convert products to sums:
y^=argymaxP(y)i=1∏nP(xi∣y)=argymaxlog(P(y)i=1∏nP(xi∣y))=argymax[logP(y)+i=1∑nlogP(xi∣y)] Implementation: Instead of multiplying probabilities, add log probabilities:
score = prior_prob * prob_x1 * prob_x2 * ... * prob_xn
log_score = log(prior_prob) + log(prob_x1) + log(prob_x2) + ... + log(prob_xn)
Why it works:
- log(ab)=log(a)+log(b)
- log is monotonically increasing, so argmax is preserved
- Sums are more numerically stable than products of small numbers
Key Properties
- Generative Model: Models P(x∣y) and P(y), then uses Bayes' rule
- Fast Training: Only requires counting and estimating probabilities
- Fast Prediction: Linear in number of features
- Works Well with Small Data: Simple probability estimates can be reliable with few examples
- Feature Independence Assumption: Often violated in practice, but works surprisingly well for many tasks (especially text classification)