Moving Beyond Linear Functions

Motivating Example

  • Motivating Example

  • Polynomial Functions

  • Step Functions

  • Regression Splines

Plot

Code
library(tidyverse)
x <- rnorm(1000, sd = 0.25)
y <- 1 + 5.3 * x - 45 * x^2 - 35.5 * x^3 + 60 * x^4 + rnorm(1000, sd = 0.5)
ggplot(tibble(x=x, y=y), aes(x,y)) + geom_point() + theme_bw()

Plot

Code
library(tidyverse)
x <- rnorm(1000, sd = 3.5)
y <- 1 + sinpi(x/8) + rnorm(1000, sd = 0.25)
ggplot(tibble(x=x, y=y), aes(x,y)) + geom_point() + theme_bw()

Polynomial Functions

  • Motivating Example

  • Polynomial Functions

  • Step Functions

  • Regression Splines

Simple Linear Regression

Simple Linear Regression models the association between one predictor X and an outcome Y:

\[ Y = \beta_0 + \beta_1 X + \varepsilon \]

Polynomial Regression

Polynomial Regression models the association between predictor X and outcome Y with a polynomial function. For example, X can be related with Y with a cubic polynomial:

\[ Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 + \varepsilon \]

Finding estimates of \(\boldsymbol \beta\)

The estimates of \(\boldsymbol \beta\) can be found by minimizing the following least squares formula for a given data set:

\[ L(\boldsymbol \beta) = \sum^n_{i=1}(Y_i-\hat Y_i)^2 \]

\[ \hat Y_i = \hat \beta_0 + \hat\beta_1 X_i + \hat\beta_2 X_i^2 + \hat\beta_3 X_i^3 \]

Polynomial Functions in GLMs

Polynomial functions can be utilized in GLMs as well. The model below is for a logistic model:

\[ P(Y=1|X) = \frac{\exp\left\{ \beta_0 + \sum^3_{j=1}\beta_j X^j \right\}}{1 +\exp \left\{ \beta_0 + \sum^3_{j=1}\beta_j X^j\right\}} \]

Fitting a Linear Model

Code
x <- rnorm(1000, sd = 0.25)
y <- 1 + 5.3 * x - 45 * x^2 - 35.5 * x^3 + 60 * x^4 + rnorm(1000, sd = 0.5)
ggplot(tibble(x=x, y=y), aes(x,y)) + 
  geom_point() +
  geom_smooth(method = "lm") +
  theme_bw()

Fitting a Model

Code
x <- rnorm(1000, sd = 0.25)
y <- 1 + 5.3 * x - 45 * x^2 - 35.5 * x^3 + 60 * x^4 + rnorm(1000, sd = 0.5)
ggplot(tibble(x=x, y=y), aes(x,y)) + 
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ poly(x, 4)) +
  theme_bw()

Polynomial Functions in R

lm(y ~ poly(x, degree = p),
   data)

Individual terms in R

lm(y ~ I(x^3),
   data)

Example

Simulate the data below a fit a quadratic, cubic, and quartic model. Compute the mean squared error and \(R^2\) for each model.

x <- rnorm(1000, sd = 0.25)
y <- 1 + 5.3 * x - 45 * x^2 - 35.5 * x^3 + 60 * x^4 + rnorm(1000, sd = 0.5)

Step Functions

  • Motivating Example

  • Polynomial Functions

  • Step Functions

  • Regression Splines

Stepwise Function

Stepwise Function will add horizontal lines to best explain the data at different ranges of X.

Plot

Code
library(tidyverse)
library(splines)
x <- rnorm(1000, sd = 0.25)
y <- 1 + 5.3 * x - 45 * x^2 - 35.5 * x^3 + 60 * x^4 + rnorm(1000, sd = 0.5)
ggplot(tibble(x=x, y=y), aes(x,y)) + 
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ cut(x, 10)) +
  theme_bw()

Constructing Model

  • Divide the range of X with k different intervals.
  • Create k-1 dummy variables indicating if the value X belongs to the interval or not.
  • Construct a model incorporating all dummy variable and their corresponding coefficient.
  • Find the estimates of the model by minimizing the Least Squares Estimator.

Step Functions in R

lm(y ~ cut(x, 10),
   data)

Example

Simulate the data below a fit step regression model with 10, 20, and 30 steps. Compute the mean squared error and \(R^2\) for each model.

x <- rnorm(1000, sd = 0.25)
y <- 1 + 5.3 * x - 45 * x^2 - 35.5 * x^3 + 60 * x^4 + rnorm(1000, sd = 0.5)

Regression Splines

  • Motivating Example

  • Polynomial Functions

  • Step Functions

  • Regression Splines

Basis Functions

Basis functions model the outcome Y with a set of predefined functions on X:

\[ Y = \beta_0 + \sum^p_{j=1}\beta_jb_j(X) + \varepsilon \]

  • \(\boldsymbol \beta\): Regression Coefficients
  • \(b_j(\cdot)\): basis functions

\(b_j(\cdot)\) is allowed to be any function we define

Knots

Code
x <- rnorm(1000, sd = 0.25)
y <- 1 + 5.3 * x - 45 * x^2 + rnorm(1000, sd = 0.5)
ggplot(tibble(x=x, y=y), aes(x,y)) + 
  geom_point(size = 0.5) +
  geom_smooth(method = "lm", formula = y ~ x + I(x * (x>0))) +
  theme_bw()

More Knots

Code
x <- rnorm(1000, sd = 0.25)
y <- 1 + 5.3 * x - 45 * x^2 + rnorm(1000, sd = 0.5)
ggplot(tibble(x=x, y=y), aes(x,y)) + 
  geom_point(size = 0.5) +
  geom_smooth(method = "lm", formula = y ~ bs(x, knots = c(-0.25, 0, 0.25), degree = 1)) +
  geom_vline(xintercept = 0.25, lty = 2, lwd = .2, col = "red") +
  geom_vline(xintercept = -0.25, lty = 2, lwd = .2, col = "red") +
  geom_vline(xintercept = 0, lty = 2, lwd = .2, col = "red") +
  theme_bw()

Truncated Power Basis Function

Once the number of knots and locations are chosen, a common basis function to utilize is the truncated power function:

\[ h(x,\xi_l) = (x-\xi_l)_+^p = \left\{ \begin{array}{cc} (x-\xi_l)^p & x>\xi_l\\ 0 & \mathrm{Otherwise} \end{array} \right. \]

Spline Function

For \(L\) knots:

\[ Y = \beta_0 + \sum^p_{j=1}x^j\beta_j + \sum_{l=1}^Lh(x, \xi_l)\beta_{p+l} + \varepsilon \]

Spline Functions Constraints

When choosing basis functions, we want maintain the following constraints at the location of the knots:

  • Continuous
  • First Differentiable
  • Second Differentiable

A common choice is to use a cubic spline function

Cubic Splines

For \(L\) knots:

\[ Y = \beta_0 + \sum^3_{j=1}x^j\beta_j + \sum_{l=1}^Lh(x, \xi_l)\beta_{3+l} + \varepsilon \]

Natural Cubic Splines

Natural splines force the boundary knots to be fitted with simple lines instead of spline functions. The interior knots are fitted with spline functions.

Plot

Code
x <- rnorm(1000, sd = 0.25)
y <- 1 + 5.3 * x - 45 * x^2 - 35.5 * x^3 + 60 * x^4 + rnorm(1000, sd = 0.5)
ggplot(tibble(x=x, y=y), aes(x,y)) + 
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ bs(x, knots = c(-0.25, 0.25))) +
  theme_bw()

Natural Cubic Splines

Code
x <- rnorm(1000, sd = 0.25)
y <- 1 + 5.3 * x - 45 * x^2 - 35.5 * x^3 + 60 * x^4 + rnorm(1000, sd = 0.5)
ggplot(tibble(x=x, y=y), aes(x,y)) + 
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ ns(x, knots = c(-0.25, 0.25))) +
  theme_bw()

Cubic Splines in R

The bs() function in R from the splines package will construct the basis matrix to be fitted in the lm function. You will need to specify the locations of the knots. By default, the cubic splines are provided:

lm(y ~ bs(x, knots = c()),
   data)

Natural Cubic Splines in R

The ns() function in R from the splines package will construct the basis matrix for natural cubic splines. You will need to specify the locations of the knots.

lm(y ~ ns(x, knots = c()),
   data)

Example

Simulate the data below a fit regression model using cubic splines and natural cubic splines. Choose what you think is the appropriate number of knots and locations. Compute the mean squared error and \(R^2\) for each model.

x <- rnorm(1000, sd = 0.25)
y <- 1 + 5.3 * x - 45 * x^2 - 35.5 * x^3 + 60 * x^4 + rnorm(1000, sd = 0.5)