Trees
Pruning
Classification Trees
Regression Trees
Bagging
Random Forests
Boosting
R Code
A Statistical tree will partition a region from a set of predictor variables that will predict an outcome of interest.
Trees will split a region based on a predictors ability to reduce the overall mean squared error.
Trees are sometimes preferred to linear models due to the visual explanation of the model.
Trees
Pruning
Classification Trees
Regression Trees
Bagging
Random Forests
Boosting
R Code
Pruning is the process that will remove branches from a regression tree in order to prevent overfitting.
This will result in a subtree that has high predictive power with no overfitting.
Due to the computational burden of pruning, it is recommended to implement Cost Complexity Pruning.
Let \(\alpha\) be nonnegative tuning parameter that indexes a sequence of trees. Identify the tree that reduces:
\[ \sum^{|T|}_{m=1}\sum_{i:\ x_i \in R_m}(y_i-\hat y_{R_m})^2 +\alpha|T| \]
\(|T|\): Number of terminal nodes
\(R_m\): rectangular region containing data
\(y_i\): observed value
\(\hat y_{R_m}\): predicted value in rectangular region.
Trees
Pruning
Classification Trees
Regression Trees
Bagging
Random Forests
Boosting
R Code
Classification Trees will construct a tree that will classify data based on the region (leaf) you land. The class majority is what is classified.
The Gini Index is used to determine the error rate in classification trees:
\[ G = \sum^K_{k=1} \hat p_{mk}(1-\hat p_{mk}) \]
Trees
Pruning
Classification Trees
Regression Trees
Bagging
Random Forests
Boosting
R Code
Regression trees will construct a tree and predict the value of the outcome based on the average value of the region (leaf).
Trees are constructed by minimizing the residual sums of square.
Trees
Pruning
Classification Trees
Regression Trees
Bagging
Random Forests
Boosting
R Code
When splitting the data to train and test data sets, the construction of the tree suffers from high variance.
This is due to splitting the data in a random way. One training data set will lead to different results from another training data set.
To improve performance, we implement a Bootstrap Aggregation (Bagging) technique.
Bagging will produce a forest of trees to classify a new observation.
Given a single training data set:
Sample from the data with replacement.
Build a tree from the sampled data:
\[ \hat f^{*b}(x) \]
Repeat the process B times (B=100)
Compute the final average for all predictions:
\[ \hat f_{bag}(x)=\frac{1}{B}\sum^B_{b=1}\hat f^{*b}(x) \]
To classify an observation, you can record the classification of each \(b\) tree. Then classify an observation by majority rule.
With the implementation of Bagging, you lose interpretability from the original tree due to the forest.
However, we can compute which variables reduced the RSS or Gini Index for all the trees. The variables with the largest reduction are considered important.
Trees
Pruning
Classification Trees
Regression Trees
Bagging
Random Forests
Boosting
R Code
Random Forests is an extension of Bagging, where a forest is generated from a bootstrap-based approach. However, when making a split, a random set of predictors (m<p) are chosen for the split, instead of the full set p.
This will ensure that trees are unique, uncorrelated.
It ensures that no one predictor will have all the power and lower the variance.
Trees
Pruning
Classification Trees
Regression Trees
Bagging
Random Forests
Boosting
R Code
Boosting is a mechanism where a final tree is built slowly from smaller trees using the residuals.
This ensures a tree is built from a slow process and prevents overfitting.
This is done to improve prediction capabilities.
Set \(\hat f(x) = 0\) and \(r_i = y_i\) for all \(i\) in the training set
For \(b=1, 2, \ldots, B\) repeat:
Fit tree \(\hat f^b\) with \(d\) splits (\(d+1\) terminal nodes) to the training data \((X,r)\)
Update \(\hat f\)
\[ \hat f(x) \leftarrow \hat f(x) + \lambda\hat f^b(x) \]
Update residuals
\[ r_i \leftarrow r_i - \lambda\hat f^{b}(x_i) \]
Output boosted model:
\[ \hat f(x) = \sum^B_{b=1} \lambda \hat f^b(x) \]
Trees
Pruning
Classification Trees
Regression Trees
Bagging
Random Forests
Boosting
R Code
library(randomForest)
library(palmerpenguins)
library(tidyverse)
library(magrittr)
penguins <- penguins |> drop_na()
train <- sample(1:nrow(penguins), nrow(penguins)/2)
bag_penguins <- penguins |> randomForest(body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm,
data = _,
subset = train,
mtry = 3,
importance = T)