Skip to main content

Generalized Linear Models

Generalized Linear Models (GLMs) extend linear regression to handle different response types (binary, count, etc.) by using exponential family distributions.

Three Components of a GLM

  1. Exponential Family Distribution: Yx;θExpFamily(η)Y|x;\theta \sim \text{ExpFamily}(\eta)
  2. Linear Predictor: The natural parameter η\eta and inputs xx are related linearly: η=θTx\eta = \theta^T x
  3. Response Function: μ=E[Yx]=g1(η)\mu = \mathbb{E}[Y|x] = g^{-1}(\eta) where g1g^{-1} is the canonical response function

Exponential Family Form

A distribution belongs to the exponential family if it can be written as:

p(y;η)=b(y)exp(ηTT(y)a(η))p(y;\eta) = b(y) \exp(\eta^T T(y) - a(\eta))

where:

  • η\eta = natural parameter
  • T(y)T(y) = sufficient statistic (usually T(y)=yT(y) = y)
  • a(η)a(\eta) = log partition function
  • b(y)b(y) = base measure

Key Properties:

E[T(Y)]=a(η)ηVar(T(Y))=2a(η)η2\mathbb{E}[T(Y)] = \frac{\partial a(\eta)}{\partial \eta} \qquad \text{Var}(T(Y)) = \frac{\partial^2 a(\eta)}{\partial \eta^2}

Common GLMs

DistributionMeanVarianceη\eta (natural parameter)a(η)a(\eta)g1(η)g^{-1}(\eta) (response function)g(μ)g(\mu) (link)
Gaussianμ\muσ2\sigma^2μ\muη22\frac{\eta^2}{2}η\etaμ\mu
Bernoullippp(1p)p(1-p)logp1p\log\frac{p}{1-p}log(1+eη)\log(1+e^\eta)11+eη\frac{1}{1+e^{-\eta}}logμ1μ\log\frac{\mu}{1-\mu}
Poissonλ\lambdaλ\lambdalogλ\log\lambdaeηe^\etaeηe^\etalogμ\log\mu

Canonical Link: Using g(μ)=ηg(\mu) = \eta makes optimization convex and gradients simpler.

Identity Link (Gaussian): For the Gaussian GLM, since g1(η)=ηg^{-1}(\eta) = \eta when η=μ\eta = \mu, we have g(μ)=g1(μ)=μg(\mu) = g^{-1}(\mu) = \mu. This is the identity function where the link and response functions are the same, making linear regression particularly simple.

Naive Bayes with Exponential Family

Key Result: When class-conditional distributions are from the exponential family, the posterior has a logistic form.

Setup: For binary classification with:

  • Bernoulli prior: p(y)=ϕy(1ϕ)1yp(y) = \phi^y (1-\phi)^{1-y} where y{0,1}y \in \{0,1\}
  • Exponential family class-conditionals: p(xy=j;ηj)=b(x)exp(ηjTT(x)a(ηj))p(x|y=j;\eta_j) = b(x) \exp(\eta_j^T T(x) - a(\eta_j))

Derivation: Using Bayes' rule:

p(y=1x;ϕ,η0,η1)=p(y=1;ϕ)p(xy=1;η1)p(y=0;ϕ)p(xy=0;η0)+p(y=1;ϕ)p(xy=1;η1)p(y=1|x;\phi,\eta_0,\eta_1) = \frac{p(y=1;\phi)p(x|y=1;\eta_1)}{p(y=0;\phi)p(x|y=0;\eta_0) + p(y=1;\phi)p(x|y=1;\eta_1)}

Substituting exponential family form and simplifying:

p(y=1x;ϕ,η0,η1)=11+exp(log1ϕϕ+(η0η1)TT(x)+a(η1)a(η0))p(y=1|x;\phi,\eta_0,\eta_1) = \frac{1}{1 + \exp\left(\log\frac{1-\phi}{\phi} + (\eta_0 - \eta_1)^T T(x) + a(\eta_1) - a(\eta_0)\right)}

This has the form σ(η~TT(x)+c)\sigma(\tilde{\eta}^T T(x) + c) where σ(t)=11+exp(t)\sigma(t) = \frac{1}{1+\exp(-t)} is the sigmoid function, with:

η~=η1η0\tilde{\eta} = \eta_1 - \eta_0
c=a(η0)a(η1)log1ϕϕc = a(\eta_0) - a(\eta_1) - \log\frac{1-\phi}{\phi}

Interpretation: This shows that Naive Bayes with exponential family distributions produces the same decision boundary as logistic regression, though the parameters are estimated differently (generatively vs. discriminatively).

info

The decision boundary is the set p(y=1x,ϕ,η0,η1)=12p(y = 1 |x,\phi,\eta_0,\eta_1) = \frac{1}{2}. Based on the previous part, this is the same as η~TT(x)+c=0\tilde{\eta}^T T(x) + c= 0. For this to be linear in xx, TT must be affine in xx, i.e. T(x)=Ax+vT(x) = Ax+ v for some matrix AA and vector vv.

Optimization

Log-Likelihood: Given training data {(x(i),y(i))}i=1m\{(\mathbf{x}^{(i)}, y^{(i)})\}_{i=1}^m, the log-likelihood is:

(θ)=i=1mlogp(y(i)x(i);θ)(θ)=i=1m[η(i)TT(y(i))a(η(i))+logb(y(i))]\ell(\theta) = \sum_{i=1}^m \log p(y^{(i)}|\mathbf{x}^{(i)};\theta) \newline \ell(\theta) = \sum_{i=1}^m \left[ \eta^{(i)T} T(y^{(i)}) - a(\eta^{(i)}) + \log b(y^{(i)}) \right]

where η(i)=θTx(i)\eta^{(i)} = \theta^T \mathbf{x}^{(i)}.

Grouped Form: For classification, by grouping terms by class jj where Sj={i:y(i)=j}S_j = \{i : y^{(i)} = j\} and nj=Sjn_j = |S_j|, the ηj\eta_j-dependent terms are:

lj(ηj)iSjηjT(x(i))nja(ηj)l_j(\eta_j) \propto \sum_{i \in S_j} \eta_j^{\top}T(\mathbf{x}^{(i)}) - n_j a(\eta_j)

Gradient: For a GLM with canonical link, the gradient has the form:

θ(θ)=i=1m(y(i)hθ(x(i)))x(i)\nabla_\theta \ell(\theta) = \sum_{i=1}^m (y^{(i)} - h_\theta(x^{(i)})) x^{(i)}

where hθ(x)=E[Yx;θ]=g1(θTx)h_\theta(x) = \mathbb{E}[Y|x;\theta] = g^{-1}(\theta^T x) is the hypothesis function.

Key fact: The negative log-likelihood is convex, so gradient descent converges to the global optimum.