Skip to main content

Naive Bayes

Naive Bayes is a generative learning algorithm based on Bayes' Theorem with the "naive" assumption that features are conditionally independent given the class label.

Algorithm Overview

For classification, Naive Bayes predicts the class that maximizes the posterior probability:

y^=argmaxyP(yx)=argmaxyP(y)i=1nP(xiy)\hat{y} = \arg\max_{y} P(y|\mathbf{x}) = \arg\max_{y} P(y) \prod_{i=1}^{n} P(x_i|y)

where:

  • P(y)P(y) = prior probability of class yy
  • P(xiy)P(x_i|y) = probability of feature xix_i given class yy
  • The "naive" assumption: P(xy)=i=1nP(xiy)P(\mathbf{x}|y) = \prod_{i=1}^{n} P(x_i|y) (features are independent given class)

Training Procedure

Given training data {(x(i),y(i))}i=1m\{(\mathbf{x}^{(i)}, y^{(i)})\}_{i=1}^m:

  1. Estimate Prior Probabilities:

    P(y=k)=count(y(i)=k)mP(y = k) = \frac{\text{count}(y^{(i)} = k)}{m}

    For binary classification: P(y=1)=number of class 1 examplesmP(y=1) = \frac{\text{number of class 1 examples}}{m}

  2. Estimate Class-Conditional Probabilities:

    For discrete features (e.g., binary, categorical):

    P(xi=vy=k)=count(xi=v and y=k)count(y=k)P(x_i = v | y = k) = \frac{\text{count}(x_i = v \text{ and } y = k)}{\text{count}(y = k)}

    For continuous features, typically assume Gaussian distribution:

    P(xiy=k)=N(μik,σik2)P(x_i | y = k) = \mathcal{N}(\mu_{ik}, \sigma_{ik}^2)

    where μik\mu_{ik} and σik2\sigma_{ik}^2 are the mean and variance of feature ii for class kk.

Prediction Procedure

For a new example x=(x1,x2,,xn)\mathbf{x} = (x_1, x_2, \ldots, x_n):

  1. Compute Unnormalized Scores for each class kk:

    score(k)=P(y=k)i=1nP(xiy=k)\text{score}(k) = P(y = k) \prod_{i=1}^{n} P(x_i | y = k)
  2. Predict the Class with highest score:

    y^=argmaxkscore(k)\hat{y} = \arg\max_{k} \text{score}(k)
  3. Optional: Normalize to Get Probabilities:

    P(y=kx)=score(k)jscore(j)P(y = k | \mathbf{x}) = \frac{\text{score}(k)}{\sum_{j} \text{score}(j)}

Example Calculation (from image):

Given x=(1,1,1,0,1,0,0,1,0)\mathbf{x} = (1, 1, 1, 0, 1, 0, 0, 1, 0) and classes {1,7}\{1, 7\}:

  • Prior: P(y=1)=4/9P(y=1) = 4/9, P(y=7)=5/9P(y=7) = 5/9
  • For class y=1y=1: i=19P(Xiy=1)=(1/4)(2/4)(3/4)(3/4)(1/4)(1/4)(3/4)(1/4)(2/4)=108/49\prod_{i=1}^9 P(X_i|y=1) = (1/4) \cdot (2/4) \cdot (3/4) \cdot (3/4) \cdot (1/4) \cdot (1/4) \cdot (3/4) \cdot (1/4) \cdot (2/4) = 108/4^9
  • Unnormalized: P(y=1)P(Xiy=1)=(4/9)(108/49)=0.000183P(y=1) \prod P(X_i|y=1) = (4/9) \cdot (108/4^9) = 0.000183
  • Unnormalized: P(y=7)P(Xiy=7)=(5/9)(20736/59)=0.005898P(y=7) \prod P(X_i|y=7) = (5/9) \cdot (20736/5^9) = 0.005898
  • Normalized: P(y=1x)=0.03P(y=1|\mathbf{x}) = 0.03, P(y=7x)=0.97P(y=7|\mathbf{x}) = 0.97
  • Prediction: y=7y = 7 (highest probability)

Laplace Smoothing (Add-α\alpha Smoothing)

Problem: If a feature value xi=vx_i = v never occurs with class y=ky = k in training, then P(xi=vy=k)=0P(x_i = v | y = k) = 0, causing the entire product to be zero.

Solution: Add pseudo-counts to avoid zero probabilities:

P(xi=vy=k)=count(xi=v and y=k)+αcount(y=k)+αvalues of xiP(x_i = v | y = k) = \frac{\text{count}(x_i = v \text{ and } y = k) + \alpha}{\text{count}(y = k) + \alpha \cdot |\text{values of } x_i|}

where:

  • α>0\alpha > 0 is the smoothing parameter (typically α=1\alpha = 1 for Laplace smoothing)
  • values of xi|\text{values of } x_i| is the number of possible values feature xix_i can take

Special Case - Binary Features:

P(xi=1y=k)=count(xi=1 and y=k)+αcount(y=k)+2αP(x_i = 1 | y = k) = \frac{\text{count}(x_i = 1 \text{ and } y = k) + \alpha}{\text{count}(y = k) + 2\alpha}

With α=1\alpha = 1 (Laplace smoothing), this ensures every probability is at least 1count(y=k)+2\frac{1}{\text{count}(y = k) + 2}, preventing zeros.

Numerical Stability: Log Probabilities

Problem: Multiplying many small probabilities (each <1< 1) causes numerical underflow (result becomes 0 due to limited floating-point precision).

Solution: Work in log space to convert products to sums:

y^=argmaxyP(y)i=1nP(xiy)=argmaxylog(P(y)i=1nP(xiy))=argmaxy[logP(y)+i=1nlogP(xiy)]\begin{align} \hat{y} &= \arg\max_{y} P(y) \prod_{i=1}^{n} P(x_i|y) \\ &= \arg\max_{y} \log\left(P(y) \prod_{i=1}^{n} P(x_i|y)\right) \\ &= \arg\max_{y} \left[\log P(y) + \sum_{i=1}^{n} \log P(x_i|y)\right] \end{align}

Implementation: Instead of multiplying probabilities, add log probabilities:

# Instead of:
score = prior_prob * prob_x1 * prob_x2 * ... * prob_xn

# Use:
log_score = log(prior_prob) + log(prob_x1) + log(prob_x2) + ... + log(prob_xn)

Why it works:

  • log(ab)=log(a)+log(b)\log(ab) = \log(a) + \log(b)
  • log\log is monotonically increasing, so argmax\arg\max is preserved
  • Sums are more numerically stable than products of small numbers

Key Properties

  • Generative Model: Models P(xy)P(\mathbf{x}|y) and P(y)P(y), then uses Bayes' rule
  • Fast Training: Only requires counting and estimating probabilities
  • Fast Prediction: Linear in number of features
  • Works Well with Small Data: Simple probability estimates can be reliable with few examples
  • Feature Independence Assumption: Often violated in practice, but works surprisingly well for many tasks (especially text classification)