Motivating Example
Polynomial Functions
Step Functions
Regression Splines
Motivating Example
Polynomial Functions
Step Functions
Regression Splines
Simple Linear Regression models the association between one predictor X
and an outcome Y
:
\[ Y = \beta_0 + \beta_1 X + \varepsilon \]
Polynomial Regression models the association between predictor X
and outcome Y
with a polynomial function. For example, X
can be related with Y
with a cubic polynomial:
\[ Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 + \varepsilon \]
The estimates of \(\boldsymbol \beta\) can be found by minimizing the following least squares formula for a given data set:
\[ L(\boldsymbol \beta) = \sum^n_{i=1}(Y_i-\hat Y_i)^2 \]
\[ \hat Y_i = \hat \beta_0 + \hat\beta_1 X_i + \hat\beta_2 X_i^2 + \hat\beta_3 X_i^3 \]
Polynomial functions can be utilized in GLMs as well. The model below is for a logistic model:
\[ P(Y=1|X) = \frac{\exp\left\{ \beta_0 + \sum^3_{j=1}\beta_j X^j \right\}}{1 +\exp \left\{ \beta_0 + \sum^3_{j=1}\beta_j X^j\right\}} \]
Simulate the data below a fit a quadratic, cubic, and quartic model. Compute the mean squared error and \(R^2\) for each model.
Motivating Example
Polynomial Functions
Step Functions
Regression Splines
Stepwise Function will add horizontal lines to best explain the data at different ranges of X
.
X
with k
different intervals.k-1
dummy variables indicating if the value X
belongs to the interval or not.Simulate the data below a fit step regression model with 10, 20, and 30 steps. Compute the mean squared error and \(R^2\) for each model.
Motivating Example
Polynomial Functions
Step Functions
Regression Splines
Basis functions model the outcome Y
with a set of predefined functions on X
:
\[ Y = \beta_0 + \sum^p_{j=1}\beta_jb_j(X) + \varepsilon \]
\(b_j(\cdot)\) is allowed to be any function we define
x <- rnorm(1000, sd = 0.25)
y <- 1 + 5.3 * x - 45 * x^2 + rnorm(1000, sd = 0.5)
ggplot(tibble(x=x, y=y), aes(x,y)) +
geom_point(size = 0.5) +
geom_smooth(method = "lm", formula = y ~ bs(x, knots = c(-0.25, 0, 0.25), degree = 1)) +
geom_vline(xintercept = 0.25, lty = 2, lwd = .2, col = "red") +
geom_vline(xintercept = -0.25, lty = 2, lwd = .2, col = "red") +
geom_vline(xintercept = 0, lty = 2, lwd = .2, col = "red") +
theme_bw()
Once the number of knots and locations are chosen, a common basis function to utilize is the truncated power function:
\[ h(x,\xi_l) = (x-\xi_l)_+^p = \left\{ \begin{array}{cc} (x-\xi_l)^p & x>\xi_l\\ 0 & \mathrm{Otherwise} \end{array} \right. \]
For \(L\) knots:
\[ Y = \beta_0 + \sum^p_{j=1}x^j\beta_j + \sum_{l=1}^Lh(x, \xi_l)\beta_{p+l} + \varepsilon \]
When choosing basis functions, we want maintain the following constraints at the location of the knots:
A common choice is to use a cubic spline function
For \(L\) knots:
\[ Y = \beta_0 + \sum^3_{j=1}x^j\beta_j + \sum_{l=1}^Lh(x, \xi_l)\beta_{3+l} + \varepsilon \]
Natural splines force the boundary knots to be fitted with simple lines instead of spline functions. The interior knots are fitted with spline functions.
The bs()
function in R from the splines
package will construct the basis matrix to be fitted in the lm
function. You will need to specify the locations of the knots. By default, the cubic splines are provided:
The ns()
function in R from the splines
package will construct the basis matrix for natural cubic splines. You will need to specify the locations of the knots.
Simulate the data below a fit regression model using cubic splines and natural cubic splines. Choose what you think is the appropriate number of knots and locations. Compute the mean squared error and \(R^2\) for each model.