Introduction to Statistical Learning
Classification vs Regression
Supervised vs Unsupervised Machine Learning
Model Adequacy
What is Statistical Learning?
Statistical learning is the task of predicting an outcome of interest given a set of predictor variables.
\[ Y = f(\boldsymbol X) + \varepsilon \]
\(Y\): Outcome variable
\(f(\cdot)\): systematic component explaining \(Y\)
\(\boldsymbol X\): vector of predictor variables
\(\varepsilon\): error term
Linear Models
Generalized Linear Models (GLM)
Generalized Additive Models
Local-Linear Models
Smoothing Splines
Statistical Learning is only concerned with an accurate \(Y\)
\(f(\cdot)\) is considered a black box
We will not know how \(\boldsymbol X\) explains \(Y\)
We choose flexible (nonparametric) models
With a focus on prediction, model interpretability declines
We will not know how changes in \(\boldsymbol X\) will affect \(Y\)
Regression in statistical learning terms indicates predicting a continuous random variable.
What are the methods that we learned to model continuous random variables?
Classification in statistical learning terms indicates predicting a categorical random variable.
What are the methods that we learned to model categorical random variables?
#> iris_pred
#> setosa versicolor virginica
#> setosa 20 0 0
#> versicolor 0 19 1
#> virginica 0 0 20
#> Actual
#> Predicted setosa versicolor virginica
#> setosa 50 0 0
#> versicolor 0 48 2
#> virginica 0 2 48
#>
#> predicted setosa versicolor virginica
#> setosa 25 0 0
#> versicolor 0 25 3
#> virginica 0 2 20
Machine learning is a set of methods used for predicting and classifying data. Several statistical methods are considered machine learning techniques.
Regression
Mixed-Effects
Nonparametric Regression
Neural Networks
Tree-based methods
Bayesian Methods
Training Data is the data set used to construct a model.
Supervised Machine Learning techniques are techniques where the training data contains the outcome.
Unsupervised Machine Learning techniques are techniques where the training data does not contains the outcome.
\[ MSE = \frac{1}{n}\sum^n_{i=1}\{y_i - \hat f(\boldsymbol x_i)\}^2 \]
\[ ER = \frac{1}{n}\sum^n_{i=1}I(y_i \ne \hat y_i) \]
\[ E(MSE) = E\left\{y-\hat f(x)\right\}^2 = Var\left\{\hat f(x)\right\} + Bias\left\{\hat f(x)\right\}^2 \]
x <- runif(100, 0, 10)
y <- 3 * x^2 - 5*x + 3 + rnorm(100, sd = 15)
data.frame(x, y) |>
ggplot(aes(x, y)) +
geom_point() +
stat_smooth(method = "lm",
color = "skyblue4", fill = "skyblue2") +
stat_smooth(method = "loess", span = 0.075,
color = "springgreen4", fill = "springgreen2") +
stat_smooth(color = "violetred", fill = "violet") +
theme_bw()