Selection Criteria
Subset Selection
Shrinkage Methods
When given a set of predictors, we want to build a model that only contains predictors that best fits the data, without overfitting.
Ideally, we always want to choose a parsimonious model that best describes the outcome variable. The more predictors into the model, the less parsimonious and less powerful.
Choosing the best model can be done based on selection criteria such as Mallow’s \(C_p\), AIC, AICc, BIC, and adjusted \(R^2\).
\[ C_p = \frac{1}{n}(RSS + 2 d \hat \sigma^2) \]
\(RSS\): Residual Sum of Squares
\(\hat \sigma^2\): Mean Square Error
\(d\): number of predictors
Lower is better
\[ \frac{1}{n\hat\sigma^2}(RSS+2d\hat\sigma^2) \]
\[ \frac{1}{n\hat\sigma^2}\{RSS+\log(n)d\hat\sigma^2\} \]
\[ 1-\frac{RSS}{TSS} \]
\(RSS=\sum^n_{i=1}(y_i-\hat y_i)^2\)
\(TSS=\sum^n_{i=1}(y_i-\bar y)^2\)
Higher is Better
\[ 1-\frac{RSS/(n-d-1)}{ TSS/(n-1)} \]
Best Subset Model
Forward Stepwise Model Building
Backward Stepwise Model Building
Hybrid Stepwise Regression
Fit all \(\left(^p_k\right)\) models that contain \(k\) predictors
Define \(M_k\) as the model with the largest \(R²\)
Fit all \(p-k\) models that adds one new predictor to the orginal model containing \(k\) predictors
Define \(M_{k+1}\) as the model with the largest \(R²\) among \(p-k\) models
Fit all models that contain \(k-1\) predictors
Define \(M_{k-1}\) as the model with the largest \(R²\)
#> Subset selection object
#> Call: regsubsets.formula(mpg ~ ., data = mtcars)
#> 10 Variables (and intercept)
#> Forced in Forced out
#> cyl FALSE FALSE
#> disp FALSE FALSE
#> hp FALSE FALSE
#> drat FALSE FALSE
#> wt FALSE FALSE
#> qsec FALSE FALSE
#> vs FALSE FALSE
#> am FALSE FALSE
#> gear FALSE FALSE
#> carb FALSE FALSE
#> 1 subsets of each size up to 8
#> Selection Algorithm: exhaustive
#> cyl disp hp drat wt qsec vs am gear carb
#> 1 ( 1 ) " " " " " " " " "*" " " " " " " " " " "
#> 2 ( 1 ) "*" " " " " " " "*" " " " " " " " " " "
#> 3 ( 1 ) " " " " " " " " "*" "*" " " "*" " " " "
#> 4 ( 1 ) " " " " "*" " " "*" "*" " " "*" " " " "
#> 5 ( 1 ) " " "*" "*" " " "*" "*" " " "*" " " " "
#> 6 ( 1 ) " " "*" "*" "*" "*" "*" " " "*" " " " "
#> 7 ( 1 ) " " "*" "*" "*" "*" "*" " " "*" "*" " "
#> 8 ( 1 ) " " "*" "*" "*" "*" "*" " " "*" "*" "*"
#> Subset selection object
#> Call: regsubsets.formula(mpg ~ ., data = mtcars, method = "forward")
#> 10 Variables (and intercept)
#> Forced in Forced out
#> cyl FALSE FALSE
#> disp FALSE FALSE
#> hp FALSE FALSE
#> drat FALSE FALSE
#> wt FALSE FALSE
#> qsec FALSE FALSE
#> vs FALSE FALSE
#> am FALSE FALSE
#> gear FALSE FALSE
#> carb FALSE FALSE
#> 1 subsets of each size up to 8
#> Selection Algorithm: forward
#> cyl disp hp drat wt qsec vs am gear carb
#> 1 ( 1 ) " " " " " " " " "*" " " " " " " " " " "
#> 2 ( 1 ) "*" " " " " " " "*" " " " " " " " " " "
#> 3 ( 1 ) "*" " " "*" " " "*" " " " " " " " " " "
#> 4 ( 1 ) "*" " " "*" " " "*" " " " " "*" " " " "
#> 5 ( 1 ) "*" " " "*" " " "*" "*" " " "*" " " " "
#> 6 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" " " " "
#> 7 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" " " " "
#> 8 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " "
#> Subset selection object
#> Call: regsubsets.formula(mpg ~ ., data = mtcars, method = "backward")
#> 10 Variables (and intercept)
#> Forced in Forced out
#> cyl FALSE FALSE
#> disp FALSE FALSE
#> hp FALSE FALSE
#> drat FALSE FALSE
#> wt FALSE FALSE
#> qsec FALSE FALSE
#> vs FALSE FALSE
#> am FALSE FALSE
#> gear FALSE FALSE
#> carb FALSE FALSE
#> 1 subsets of each size up to 8
#> Selection Algorithm: backward
#> cyl disp hp drat wt qsec vs am gear carb
#> 1 ( 1 ) " " " " " " " " "*" " " " " " " " " " "
#> 2 ( 1 ) " " " " " " " " "*" "*" " " " " " " " "
#> 3 ( 1 ) " " " " " " " " "*" "*" " " "*" " " " "
#> 4 ( 1 ) " " " " "*" " " "*" "*" " " "*" " " " "
#> 5 ( 1 ) " " "*" "*" " " "*" "*" " " "*" " " " "
#> 6 ( 1 ) " " "*" "*" "*" "*" "*" " " "*" " " " "
#> 7 ( 1 ) " " "*" "*" "*" "*" "*" " " "*" "*" " "
#> 8 ( 1 ) " " "*" "*" "*" "*" "*" " " "*" "*" "*"
Shrinkage methods are techniques that will reduce a full parameterized model (a high number of predictors) to a lower parameterized model (a smaller number of predictors).
Ridge regression incorporates a shrinkage penalty term to the least squares formula. The shrinkage penalty term will reduce the \(\beta\) coefficients towards 0 based on a penalty parameter (\(\lambda\))
\[ \sum^n_{i=1}\left(Y_i-\beta_0 +\sum^p_{j=1}X_{ij}\beta_j\right)^2 + \lambda\sum^p_{j=1}\beta_j^2 \]
Least Absolute Shrinkage and Selection Operator (LASSO) is known as a shrinkage method which forces \(\beta\) coefficients that do not have a significant predicit power towards and possibly equal to 0.
\[ \sum^n_{i=1}\left(Y_i-\beta_0 +\sum^p_{j=1}X_{ij}\beta_j\right)^2 + \lambda\sum^p_{j=1}|\beta_j| \]
Each method is capable on identifying the optimum MSE for the Bias-Variance trade-off scenario. This will lead to a lower prediction error. The key is to find the optimal penalty parameter. This can be done with a Cross-Validation technique (next lecture).
library(glmnet)
library(tidyverse)
library(palmerpenguins)
penguins <- penguins |> drop_na()
mod <- penguins |> model.matrix(~ flipper_length_mm + bill_depth_mm + bill_length_mm - 1,
data = _) # Must include -1 to remove intercept, needed for glmnet
ridge_mod <- glmnet(x = mod,
y = penguins$body_mass_g,
alpha = 0,
lambda = 1.3) # lambda was chosen at random
coef(ridge_mod)[,1]
#> (Intercept) flipper_length_mm bill_depth_mm bill_length_mm
#> -6400.54459 50.53481 17.07967 3.60524
lasso_mod <- glmnet(x = mod,
y = penguins$body_mass_g,
alpha = 1,
lambda = 1.3) # lambda was chosen at random
coef(lasso_mod)[,1]
#> (Intercept) flipper_length_mm bill_depth_mm bill_length_mm
#> -6372.371080 50.530548 16.241920 3.311152