Cross-Validation
Leave-one-out
K-Fold
Bootstrap Methods
Training Error Rate is the error rate of the data used to create the model of interest. It describes how well the model predicts the data used to construct it.
Test Error Rate is the error rate of predicting a new data point using the current established model.
In order to obtain the test error rate, data not used to fit the model must be available
This is not the case the majority of time.
New methods have been developed to compute the test error rate using the existing data.
A cross-validation approach is to obtain a good estimate of the error-rate of a machine learning algorithm. We split the data set into two categories: training and testing. The training data set is used to train the model, and the test data is used to test the model and compute the error rate.
A cross-validation approach is great when there is a tuning parameter. We can fit a model for different values of the tuning parameter, and we can choose which value results in the lowest error rate.
The training and testing data sets are constructed by randomly assigning data points to one type of data.
Choose a set of tuning parameters to test.
For each \(k\)th turning parameter, calculate the tuning parameter error for each value
Utilize the leave-one-out approach
For each observation fit a model with the remaining observations and fit the excluded value
Compute the following error:
\[ CVE_k = \frac{1}{n}\sum^n_{i=1}e_i \]
Identify the \(k\)th tuning parameter with the lowest \(CVE_k\)
Choose a set of tuning parameters to test.
Create different K subsets of the data.
For each \(j\)th turning parameter Calculate the tuning parameter error for each value
For each K subset, fit a model using the data excluding the Kth subset
Predict the values in the Kth subset using the fitted model
Repeat the process for each K subset
Compute the following error:
\[ CVE_j = \frac{1}{n}\sum^n_{i=1}e_i \]
Identify the \(j\)th tuning parameter with the lowest \(CVE_j\)
Several R Packages have developed methods to execute the a cross-validation approach.
glmnet
mtcars
Complete a LASSO approach using mtcars
to predict mpg from the remaining variables.
Bootstrapping methods are used when we cannot theoretically compute the standard errors. Bootstrap methods are computationally intensive but will compute accurate standard errors.
When all else fails, a bootstrap approach will compute accurate standard errors.
Fitting the following model:
library(palmerpenguins)
library(tidyverse)
penguins <- penguins |> drop_na()
penguins |> lm(body_mass_g ~ flipper_length_mm + bill_length_mm + bill_depth_mm,
data = _)
#>
#> Call:
#> lm(formula = body_mass_g ~ flipper_length_mm + bill_length_mm +
#> bill_depth_mm, data = penguins)
#>
#> Coefficients:
#> (Intercept) flipper_length_mm bill_length_mm bill_depth_mm
#> -6445.476 50.762 3.293 17.836
Obtain the Bootstrap-based Standard Errors for the regression coefficients. Use \(B=1000\) bootstrap samples.