Generalized Linear Models
Generalized Linear Models (GLMs) extend linear regression to handle different response types (binary, count, etc.) by using exponential family distributions.
Three Components of a GLM
- Exponential Family Distribution: Y∣x;θ∼ExpFamily(η)
- Linear Predictor: The natural parameter η and inputs x are related linearly: η=θTx
- Response Function: μ=E[Y∣x]=g−1(η) where g−1 is the canonical response function
A distribution belongs to the exponential family if it can be written as:
p(y;η)=b(y)exp(ηTT(y)−a(η)) where:
- η = natural parameter
- T(y) = sufficient statistic (usually T(y)=y)
- a(η) = log partition function
- b(y) = base measure
Key Properties:
E[T(Y)]=∂η∂a(η)Var(T(Y))=∂η2∂2a(η) Common GLMs
| Distribution | Mean | Variance | η (natural parameter) | a(η) | g−1(η) (response function) | g(μ) (link) |
|---|
| Gaussian | μ | σ2 | μ | 2η2 | η | μ |
| Bernoulli | p | p(1−p) | log1−pp | log(1+eη) | 1+e−η1 | log1−μμ |
| Poisson | λ | λ | logλ | eη | eη | logμ |
Canonical Link: Using g(μ)=η makes optimization convex and gradients simpler.
Identity Link (Gaussian): For the Gaussian GLM, since g−1(η)=η when η=μ, we have g(μ)=g−1(μ)=μ. This is the identity function where the link and response functions are the same, making linear regression particularly simple.
Naive Bayes with Exponential Family
Key Result: When class-conditional distributions are from the exponential family, the posterior has a logistic form.
Setup: For binary classification with:
- Bernoulli prior: p(y)=ϕy(1−ϕ)1−y where y∈{0,1}
- Exponential family class-conditionals: p(x∣y=j;ηj)=b(x)exp(ηjTT(x)−a(ηj))
Derivation: Using Bayes' rule:
p(y=1∣x;ϕ,η0,η1)=p(y=0;ϕ)p(x∣y=0;η0)+p(y=1;ϕ)p(x∣y=1;η1)p(y=1;ϕ)p(x∣y=1;η1) Substituting exponential family form and simplifying:
p(y=1∣x;ϕ,η0,η1)=1+exp(logϕ1−ϕ+(η0−η1)TT(x)+a(η1)−a(η0))1 This has the form σ(η~TT(x)+c) where σ(t)=1+exp(−t)1 is the sigmoid function, with:
η~=η1−η0 c=a(η0)−a(η1)−logϕ1−ϕ Interpretation: This shows that Naive Bayes with exponential family distributions produces the same decision boundary as logistic regression, though the parameters are estimated differently (generatively vs. discriminatively).
The decision boundary is the set p(y=1∣x,ϕ,η0,η1)=21. Based on the previous part, this is the same as η~TT(x)+c=0. For this to be linear in x, T must be affine in x, i.e. T(x)=Ax+v for some matrix A and vector v.
Optimization
Log-Likelihood: Given training data {(x(i),y(i))}i=1m, the log-likelihood is:
ℓ(θ)=i=1∑mlogp(y(i)∣x(i);θ)ℓ(θ)=i=1∑m[η(i)TT(y(i))−a(η(i))+logb(y(i))] where η(i)=θTx(i).
Grouped Form: For classification, by grouping terms by class j where Sj={i:y(i)=j} and nj=∣Sj∣, the ηj-dependent terms are:
lj(ηj)∝i∈Sj∑ηj⊤T(x(i))−nja(ηj) Gradient: For a GLM with canonical link, the gradient has the form:
∇θℓ(θ)=i=1∑m(y(i)−hθ(x(i)))x(i) where hθ(x)=E[Y∣x;θ]=g−1(θTx) is the hypothesis function.
Key fact: The negative log-likelihood is convex, so gradient descent converges to the global optimum.