Naive Bayes

Naive Bayes is a generative learning algorithm based on Bayes' Theorem with the "naive" assumption that features are conditionally independent given the class label.

Algorithm Overview

For classification, Naive Bayes predicts the class that maximizes the posterior probability:

\hat{y} = \arg\max_{y} P(y|\mathbf{x}) = \arg\max_{y} P(y) \prod_{i=1}^{n} P(x_i|y)

where:

$P(y)$ = prior probability of class $y$
$P(x_i|y)$ = probability of feature $x_i$ given class $y$
The "naive" assumption: $P(\mathbf{x}|y) = \prod_{i=1}^{n} P(x_i|y)$ (features are independent given class)

Training Procedure

Given training data $\{(\mathbf{x}^{(i)}, y^{(i)})\}_{i=1}^m$ :

Estimate Prior Probabilities:
$P(y = k) = \frac{\text{count}(y^{(i)} = k)}{m}$
For binary classification: $P(y=1) = \frac{\text{number of class 1 examples}}{m}$
Estimate Class-Conditional Probabilities:
For discrete features (e.g., binary, categorical):
$P(x_i = v | y = k) = \frac{\text{count}(x_i = v \text{ and } y = k)}{\text{count}(y = k)}$
For continuous features, typically assume Gaussian distribution:
$P(x_i | y = k) = \mathcal{N}(\mu_{ik}, \sigma_{ik}^2)$
where $\mu_{ik}$ and $\sigma_{ik}^2$ are the mean and variance of feature $i$ for class $k$ .

Prediction Procedure

For a new example $\mathbf{x} = (x_1, x_2, \ldots, x_n)$ :

Compute Unnormalized Scores for each class $k$ :
$\text{score}(k) = P(y = k) \prod_{i=1}^{n} P(x_i | y = k)$
Predict the Class with highest score:
$\hat{y} = \arg\max_{k} \text{score}(k)$
Optional: Normalize to Get Probabilities:
$P(y = k | \mathbf{x}) = \frac{\text{score}(k)}{\sum_{j} \text{score}(j)}$

Example Calculation (from image):

Given $\mathbf{x} = (1, 1, 1, 0, 1, 0, 0, 1, 0)$ and classes $\{1, 7\}$ :

Prior: $P(y=1) = 4/9$ , $P(y=7) = 5/9$
For class $y=1$ : $\prod_{i=1}^9 P(X_i|y=1) = (1/4) \cdot (2/4) \cdot (3/4) \cdot (3/4) \cdot (1/4) \cdot (1/4) \cdot (3/4) \cdot (1/4) \cdot (2/4) = 108/4^9$
Unnormalized: $P(y=1) \prod P(X_i|y=1) = (4/9) \cdot (108/4^9) = 0.000183$
Unnormalized: $P(y=7) \prod P(X_i|y=7) = (5/9) \cdot (20736/5^9) = 0.005898$
Normalized: $P(y=1|\mathbf{x}) = 0.03$ , $P(y=7|\mathbf{x}) = 0.97$
Prediction: $y = 7$ (highest probability)

Laplace Smoothing (Add- $\alpha$ Smoothing)

Problem: If a feature value $x_i = v$ never occurs with class $y = k$ in training, then $P(x_i = v | y = k) = 0$ , causing the entire product to be zero.

Solution: Add pseudo-counts to avoid zero probabilities:

P(x_i = v | y = k) = \frac{\text{count}(x_i = v \text{ and } y = k) + \alpha}{\text{count}(y = k) + \alpha \cdot |\text{values of } x_i|}

where:

$\alpha > 0$ is the smoothing parameter (typically $\alpha = 1$ for Laplace smoothing)
$|\text{values of } x_i|$ is the number of possible values feature $x_i$ can take

Special Case - Binary Features:

P(x_i = 1 | y = k) = \frac{\text{count}(x_i = 1 \text{ and } y = k) + \alpha}{\text{count}(y = k) + 2\alpha}

With $\alpha = 1$ (Laplace smoothing), this ensures every probability is at least $\frac{1}{\text{count}(y = k) + 2}$ , preventing zeros.

Numerical Stability: Log Probabilities

Problem: Multiplying many small probabilities (each $< 1$ ) causes numerical underflow (result becomes 0 due to limited floating-point precision).

Solution: Work in log space to convert products to sums:

\begin{align} \hat{y} &= \arg\max_{y} P(y) \prod_{i=1}^{n} P(x_i|y) \\ &= \arg\max_{y} \log\left(P(y) \prod_{i=1}^{n} P(x_i|y)\right) \\ &= \arg\max_{y} \left[\log P(y) + \sum_{i=1}^{n} \log P(x_i|y)\right] \end{align}

Implementation: Instead of multiplying probabilities, add log probabilities:

# Instead of:
score = prior_prob * prob_x1 * prob_x2 * ... * prob_xn

# Use:
log_score = log(prior_prob) + log(prob_x1) + log(prob_x2) + ... + log(prob_xn)

Why it works:

$\log(ab) = \log(a) + \log(b)$
$\log$ is monotonically increasing, so $\arg\max$ is preserved
Sums are more numerically stable than products of small numbers

Key Properties

Generative Model: Models $P(\mathbf{x}|y)$ and $P(y)$ , then uses Bayes' rule
Fast Training: Only requires counting and estimating probabilities
Fast Prediction: Linear in number of features
Works Well with Small Data: Simple probability estimates can be reliable with few examples
Feature Independence Assumption: Often violated in practice, but works surprisingly well for many tasks (especially text classification)

Naive Bayes

Algorithm Overview​

Training Procedure​

Prediction Procedure​

Laplace Smoothing (Add-α\alphaα Smoothing)​

Numerical Stability: Log Probabilities​

Key Properties​