Generalized Linear Models

Learning Outcomes

Exponential Family of Distributions
Generalized Linear Models
R Code

Exponential Family of Distributions

An exponential family of distributions are random variables that allow their probability density function to have the following form:

\[ f(x; \theta,\phi) = a(x,\phi)\exp\left\{\frac{x\theta-\kappa(\theta)}{\phi}\right\} \]

\(\theta\): is the canonical parameter (also a function of other parameters)
\(\kappa(\theta)\): is a known cumulant function
\(\phi>0\): dispersion parameter function
\(a(y,\phi)\): normalizing constant

Canonical Parameter

The canonical parameter represents the relationship between the random variable and the \(E(Y)=\mu\)

Normal Distribution

\[ f(x;\mu,\sigma^2)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}} \]

\[ f(x;\mu,\sigma^2)= \frac{1}{\sqrt{2\pi \sigma^2}}\exp\left\{\frac{x\mu-\mu^2/2}{\sigma^2}-\frac{x^2}{2\sigma^2}\right\} \]

Binomial Distribution

\[ f(x;n,p) = \left(\begin{array}{c}n\\x\end{array}\right) p^x(1-p)^{n-p} \]

\[ f(x;n,p) = \left(\begin{array}{c}n\\x\end{array}\right) \exp\left\{x\log\left(\frac{p}{1-p}\right) + n \log(1-p)\right\} \]

Common Distributions and Canonical Parameters

Random Variable	Canonical Parameter
Normal	\(\mu\)
Binomial	\(\log\left(\frac{\mu}{1-\mu}\right)\)
Negative Binomial	\(\log\left(\frac{\mu}{\mu+k}\right)\)
Poisson	\(\log(\mu)\)
Gamma	\(-\frac{1}{\mu}\)
Inverse Gaussian	\(-\frac{1}{2\mu^2}\)

Generalized Linear Models

A generalized linear model (GLM) is used to model the association between an outcome variable (of any data type) and a set of predictor values. We estimate a set of regression coefficients \(\boldsymbol \beta\) to explain how each predictor is related to the expected value of the outcome.

Generalized Linear Models

A GLM is composed of a systematic and random component.

Random Component

The random component is the random variable that defines the randomness and variation of the outcome variable.

Systematic Component

The systematic component is the linear model that models the association between a set of predictors and the expected value of Y:

\[ g(\mu)=\eta=\boldsymbol X_i^\mathrm T \boldsymbol \beta \]

\(\boldsymbol\beta\): regression coefficients
\(\boldsymbol X_i=(1, X_{i1}, \ldots, X_{ip})^\mathrm T\): design vector
\(\eta\): linear model
\(\mu=E(Y)\)
\(g(\cdot)\): link function

R Code

General R Code

1glm(formula,
2    data,
3    family)

1: Supply a formula for R
2: Supply the data frame
3: Which family and link function is used to model data

Poisson Regression

Poisson Regression is used when the outcome is count data:

glm(y~x,
    data, 
    family = poisson())

Gamma Regression

Gamma Regression is used when modeling the association between predictors and positive continuous values:

glm(y~x, 
    data, 
    family = Gamma())

Negative Binomial Regression

Negative Binomial Regression is used four with overdispersed count data, where the variance is larger than expected.

library(MASS)
glm.nb(y~x, 
       data)

Inverse Gaussian Regression

Inverse Gaussian Regression is used for overly dispersed positive continuous data where Gamma Regression is inappropriate:

glm(y~x, 
    data, 
    family = inverse.gaussian())

Hypothesis Tests

Hypothesis tests are used to test whether claims are valid or not. This is conducted by collecting data, setting the Null and Alternative Hypothesis.

Testing \(\beta_j\)

\(\phi\) known

\[ \frac{\hat\beta_j - \theta}{\mathrm{se}(\hat\beta_j)} \sim N(0,1) \]

\(\phi\) unknown

\[ \frac{\hat\beta_j-\theta}{\mathrm{se}(\hat\beta_j)} \sim t_{n-p^\prime} \]

Confidence Intervals

\[ PE \pm CV \times SE \]

PE: Point Estimate
CV: Critical Value \(P(X<CV) = 1-\alpha/2\)
\(\alpha\): significance level
SE: Standard Error

Examples

Fitting Models

Use the survival, MASS and GLMsData R packages to fit the models for examples.

library(GLMsData)
library(survival)
library(MASS)

Poisson Regression

Fit a model between the outcome recur, number of reccurrence, and predictors treatment, drugs or placebo, and number, the initial number of tumors, from the bladder1 data set.

bladder1 |>  glm(recur ~ treatment + number,
                 data = _,
                 family = poisson(link = "log")) |>  
  summary()

#> 
#> Call:
#> glm(formula = recur ~ treatment + number, family = poisson(link = "log"), 
#>     data = bladder1)
#> 
#> Coefficients:
#>                     Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)          1.00918    0.06057  16.661  < 2e-16 ***
#> treatmentpyridoxine  0.25506    0.06889   3.702 0.000214 ***
#> treatmentthiotepa   -0.45167    0.08626  -5.236 1.64e-07 ***
#> number               0.11603    0.01620   7.164 7.82e-13 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for poisson family taken to be 1)
#> 
#>     Null deviance: 868.47  on 293  degrees of freedom
#> Residual deviance: 772.19  on 290  degrees of freedom
#> AIC: 1529.5
#> 
#> Number of Fisher Scoring iterations: 5

Negative Binomial Regression

Fit a model between the outcome Count, viral activity (pock counts), and predictor Dilution factor from the pock data set.

data("pock")
pock |> glm.nb(Count ~ Dilution, 
               data = _)  |> 
  summary()

#> 
#> Call:
#> glm.nb(formula = Count ~ Dilution, data = pock, init.theta = 6.045172803, 
#>     link = log)
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)  5.05810    0.09622   52.57   <2e-16 ***
#> Dilution    -0.19204    0.01322  -14.53   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for Negative Binomial(6.0452) family taken to be 1)
#> 
#>     Null deviance: 262.687  on 47  degrees of freedom
#> Residual deviance:  48.891  on 46  degrees of freedom
#> AIC: 427.26
#> 
#> Number of Fisher Scoring iterations: 1
#> 
#> 
#>               Theta:  6.05 
#>           Std. Err.:  1.43 
#> 
#>  2 x log-likelihood:  -421.261

Gamma Regression

Fit a model between the outcome Foliage, foliage biomass, and predictor DBH, tree diameter at breast height, from the lime data set.

data("lime")
lime |>  glm(Foliage ~ DBH,
             data = _,
             family = Gamma(link = "log"))  |>  
  summary()

#> 
#> Call:
#> glm(formula = Foliage ~ DBH, family = Gamma(link = "log"), data = lime)
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -1.786052   0.085994  -20.77   <2e-16 ***
#> DBH          0.122388   0.004736   25.84   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for Gamma family taken to be 0.5432673)
#> 
#>     Null deviance: 508.48  on 384  degrees of freedom
#> Residual deviance: 189.57  on 383  degrees of freedom
#> AIC: 831.65
#> 
#> Number of Fisher Scoring iterations: 5

Inverse Gaussian Regression

Fit a model between the outcome Foliage, foliage biomass, and predictor DBH, tree diameter at breast height, from the lime data set.

data("lime")
lime |>  glm(Foliage ~ DBH,
             data = _,
             family = inverse.gaussian(link = "log"))  |>
  summary()

#> 
#> Call:
#> glm(formula = Foliage ~ DBH, family = inverse.gaussian(link = "log"), 
#>     data = lime)
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  -2.7290     0.1113  -24.53   <2e-16 ***
#> DBH           0.2077     0.0118   17.59   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for inverse.gaussian family taken to be 1.571078)
#> 
#>     Null deviance: 873.60  on 384  degrees of freedom
#> Residual deviance: 542.83  on 383  degrees of freedom
#> AIC: 1192.6
#> 
#> Number of Fisher Scoring iterations: 16