Introduction to Statistical Learning

Learning Outcomes

  • Introduction to Statistical Learning

  • Classification vs Regression

  • Supervised vs Unsupervised Machine Learning

  • Model Adequacy

Book

Introduction to Statistical Learning

Introduction to Statistical Learning

What is Statistical Learning?

Statistical learning is the task of predicting an outcome of interest given a set of predictor variables.

Motivating Example

Statistical Learning Model

\[ Y = f(\boldsymbol X) + \varepsilon \]

  • \(Y\): Outcome variable

  • \(f(\cdot)\): systematic component explaining \(Y\)

  • \(\boldsymbol X\): vector of predictor variables

  • \(\varepsilon\): error term

Modeling \(f(\cdot)\): Parametric

  • Linear Models

  • Generalized Linear Models (GLM)

Modeling \(f(\cdot)\): Nonparametric

  • Generalized Additive Models

  • Local-Linear Models

  • Smoothing Splines

Prediction

  • Statistical Learning is only concerned with an accurate \(Y\)

  • \(f(\cdot)\) is considered a black box

  • We will not know how \(\boldsymbol X\) explains \(Y\)

  • We choose flexible (nonparametric) models

Model Interpretability

  • With a focus on prediction, model interpretability declines

  • We will not know how changes in \(\boldsymbol X\) will affect \(Y\)

Classification vs Regression

Regression

Regression in statistical learning terms indicates predicting a continuous random variable.

What are the methods that we learned to model continuous random variables?

Example

Classification

Classification in statistical learning terms indicates predicting a categorical random variable.

What are the methods that we learned to model categorical random variables?

Example

#>             iris_pred
#>              setosa versicolor virginica
#>   setosa         20          0         0
#>   versicolor      0         19         1
#>   virginica       0          0        20
#>             Actual
#> Predicted    setosa versicolor virginica
#>   setosa         50          0         0
#>   versicolor      0         48         2
#>   virginica       0          2        48
#>             
#> predicted    setosa versicolor virginica
#>   setosa         25          0         0
#>   versicolor      0         25         3
#>   virginica       0          2        20

Supervised vs Unsupervised Machine Learning

Machine Learning

Machine learning is a set of methods used for predicting and classifying data. Several statistical methods are considered machine learning techniques.

Common Methods

  • Regression

  • Mixed-Effects

  • Nonparametric Regression

  • Neural Networks

  • Tree-based methods

  • Bayesian Methods

Training Data

Training Data is the data set used to construct a model.

Supervised

Supervised Machine Learning techniques are techniques where the training data contains the outcome.

Unsupervised

Unsupervised Machine Learning techniques are techniques where the training data does not contains the outcome.

Model Adequacy

Quality of Fit: Regression

\[ MSE = \frac{1}{n}\sum^n_{i=1}\{y_i - \hat f(\boldsymbol x_i)\}^2 \]

Quality of Fit: Classification

\[ ER = \frac{1}{n}\sum^n_{i=1}I(y_i \ne \hat y_i) \]

Bias-Variance Tradeoff

Bias-Variance Tradeoff

\[ E(MSE) = E\left\{y-\hat f(x)\right\}^2 = Var\left\{\hat f(x)\right\} + Bias\left\{\hat f(x)\right\}^2 \]

Example

Using a Line

Using a Highly Fleible Model

Semi-Flexible

All

Code
x <- runif(100, 0, 10)
y <- 3 * x^2 - 5*x + 3 + rnorm(100, sd = 15)
data.frame(x, y) |>
  ggplot(aes(x, y)) +
  geom_point() +
  stat_smooth(method = "lm", 
              color = "skyblue4", fill = "skyblue2") +
  stat_smooth(method = "loess", span = 0.075, 
              color = "springgreen4", fill = "springgreen2") +
  stat_smooth(color = "violetred", fill = "violet") +
  theme_bw()