Introduction to Statistical Learning

Learning Outcomes

Introduction to Statistical Learning
Classification vs Regression
Supervised vs Unsupervised Machine Learning
Model Adequacy

Book

Introduction to Statistical Learning

What is Statistical Learning?

Statistical learning is the task of predicting an outcome of interest given a set of predictor variables.

Motivating Example

Statistical Learning Model

\[ Y = f(\boldsymbol X) + \varepsilon \]

\(Y\): Outcome variable
\(f(\cdot)\): systematic component explaining \(Y\)
\(\boldsymbol X\): vector of predictor variables
\(\varepsilon\): error term

Modeling \(f(\cdot)\): Parametric

Linear Models
Generalized Linear Models (GLM)

Modeling \(f(\cdot)\): Nonparametric

Generalized Additive Models
Local-Linear Models
Smoothing Splines

Prediction

Statistical Learning is only concerned with an accurate \(Y\)
\(f(\cdot)\) is considered a black box
We will not know how \(\boldsymbol X\) explains \(Y\)
We choose flexible (nonparametric) models

Model Interpretability

With a focus on prediction, model interpretability declines
We will not know how changes in \(\boldsymbol X\) will affect \(Y\)

Classification vs Regression

Regression

Regression in statistical learning terms indicates predicting a continuous random variable.

What are the methods that we learned to model continuous random variables?

Classification

Classification in statistical learning terms indicates predicting a categorical random variable.

What are the methods that we learned to model categorical random variables?

Example

Data
KNN
SVM
NN

#>             iris_pred
#>              setosa versicolor virginica
#>   setosa         20          0         0
#>   versicolor      0         19         1
#>   virginica       0          0        20

#>             Actual
#> Predicted    setosa versicolor virginica
#>   setosa         50          0         0
#>   versicolor      0         48         2
#>   virginica       0          2        48

#>             
#> predicted    setosa versicolor virginica
#>   setosa         25          0         0
#>   versicolor      0         25         3
#>   virginica       0          2        20

Supervised vs Unsupervised Machine Learning

Machine Learning

Machine learning is a set of methods used for predicting and classifying data. Several statistical methods are considered machine learning techniques.

Common Methods

Regression
Mixed-Effects
Nonparametric Regression
Neural Networks
Tree-based methods
Bayesian Methods

Training Data

Training Data is the data set used to construct a model.

Supervised

Supervised Machine Learning techniques are techniques where the training data contains the outcome.

Unsupervised

Unsupervised Machine Learning techniques are techniques where the training data does not contains the outcome.

Model Adequacy

Quality of Fit: Regression

\[ MSE = \frac{1}{n}\sum^n_{i=1}\{y_i - \hat f(\boldsymbol x_i)\}^2 \]

Quality of Fit: Classification

\[ ER = \frac{1}{n}\sum^n_{i=1}I(y_i \ne \hat y_i) \]

Bias-Variance Tradeoff

\[ E(MSE) = E\left\{y-\hat f(x)\right\}^2 = Var\left\{\hat f(x)\right\} + Bias\left\{\hat f(x)\right\}^2 \]

Example

Using a Line

Using a Highly Fleible Model

Semi-Flexible

All

Code

x <- runif(100, 0, 10)
y <- 3 * x^2 - 5*x + 3 + rnorm(100, sd = 15)
data.frame(x, y) |>
  ggplot(aes(x, y)) +
  geom_point() +
  stat_smooth(method = "lm", 
              color = "skyblue4", fill = "skyblue2") +
  stat_smooth(method = "loess", span = 0.075, 
              color = "springgreen4", fill = "springgreen2") +
  stat_smooth(color = "violetred", fill = "violet") +
  theme_bw()