Residual Analysis

Learning Objective

  • Model Assumptions
  • Residual Analysis
  • Multicollinearity

Model Assumptions

Model

\[ y = \beta_0 + \beta_1 x + \epsilon \]

  • \(\epsilon \sim N(0,\sigma^2)\)

Model Scatter Plot

x <- rnorm(1000, 8)
y <- -4 + 2*x + rnorm(1000, sd = 1)
df <- tibble(x,y)
ggplot(df, aes(x,y)) + geom_point() +
  theme_bw()

Model Assumptions

  • Errors are normally distributed

  • Constant Variance

  • Linearity

  • Independence

  • No outliers

Errors Normally Distributed

Constant Variance

Constant Variance

x <- rnorm(1000, 8, sd = 0.5)
y <- sapply(x, \(.) -4 + 5*. + rnorm(1, sd = ./2))
df <- tibble(x,y)
ggplot(df, aes(x,y)) + geom_point() +
  theme_bw()

Linearity

x <- rnorm(1000, 8)
y <- -4 + 2*x + rnorm(1000, sd = 1)
df <- tibble(x,y)
ggplot(df, aes(x,y)) + geom_point() +
  theme_bw()

Linearity

x <- rnorm(1000, 8, sd = 4)
y <- -4 - 1*x + -2*x^2 + rnorm(1000, sd = 16)
df <- tibble(x,y)
ggplot(df, aes(x,y)) + geom_point() +
  theme_bw()

No Outliers

x <- rnorm(1000, 8)
y <- -4 + 2*x + rnorm(1000, sd = 1)
df <- tibble(x,y)
ggplot(df, aes(x,y)) + geom_point() +
  theme_bw()

Residuals and Influential Measurements

Residuals

Residuals are the errors between the observed value and the estimated model. Common residuals include

  • Raw Residual

  • Standardized Residual

  • Jackknife (studentized) Residuals

Influential Measurements

Influential measures are statistics that determine how much a data point affects the model. Common influential measures are

  • Leverages

  • Cook’s Distance

Raw Residuals

\[ \hat r_i = y_i - \hat y_i \]

Leverages

\[ H = \boldsymbol X (\boldsymbol X^\mathrm T\boldsymbol X)^{-1}\boldsymbol X ^\mathrm T \]

  • \(\boldsymbol X\): design matrix

  • \(h_{ii} = H[i,i]\): leverage for \(i\)th value

Standardized Residuals

\[ \hat r^*_i = \frac{\hat r_i}{\sqrt{\hat\sigma^2(1-h_{ii})}} \]

  • \(\hat \sigma^2\): Estimated mean square error

  • \(h_{ii}\): Leverage of \(i\)th data point

Jackknife Residuals

\[ \hat r ^\prime_i = \frac{y_i - \hat y_{i(i)}}{\sqrt{\hat \sigma^2_{(i)}(1-h_{ii})}} \]

  • \(\hat y_{i(i)}\): fitted value for \(i\)th value from model fitted without \(i\)th data point

  • \(\hat\sigma^2_{(i)}\): mean square error from model fitted without \(i\)th data point

Cook’s Distance

\[ \hat d_i = \frac{(y_i - \hat y_{i})^2}{(k+1)\hat \sigma^2}\left\{\frac{h_{ii}}{(1-h_{ii})^2}\right\} \]

  • \(k\): number of predictors

Residual Analysis

Residual Analysis

A residual analysis is used to test the assumptions of linear regression.

QQ Plot

A qq (quantile-quantile) plot will plot the estimated quantiles of the residuals against the theoretical quantiles from a normal distribution function. If the points from the qq-plot lie on the \(y=x\) line, it is said that the residuals follow a normal distribution.

Residual vs Fitted Plot

This plot allows you to assess the linearity, constant variance, and identify potential outliers. Create a scatter plot between the fitted values (x-axis) and the raw/standardized residuals (y-axis).

Residual vs X Plots

This plot helps identify issues with linearity and suggests potential solution. Create a scatter plot between raw/standardized residuals (y-axis) and the predictor variables (x-axis).

Outlier Plots

An outlier plot can tell you if there are any outliers in the data. Create a scatter plot between the index number (x-axis) and standardized/studentized residuals (y-axis)

Influential Observations Plots

Will identify outliers/observations that will have an affect on the model. Create a scatter plot between the index number (x-axis) and leverages/cook’s distance (y-axis).

Multicollinearity

Mulitcollinearity

Multicolinearity occurs when predictor variable have a correlation between each other. Collinearity between predictor variables with inflate the standard errors and cause problems with inference.

Variance Inflation Factor

The variance inflation factor is a measurement on how much variables are collinear with each other. A value greater than 10 is a cause for concern and action should be taken.

R Code

Fitting a model

x_lm <- iris |> lm(Petal.Length ~ Sepal.Length + Sepal.Width, data = _) 
x_lm |> summary()
#> 
#> Call:
#> lm(formula = Petal.Length ~ Sepal.Length + Sepal.Width, data = iris)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1.25582 -0.46922 -0.05741  0.45530  1.75599 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  -2.52476    0.56344  -4.481 1.48e-05 ***
#> Sepal.Length  1.77559    0.06441  27.569  < 2e-16 ***
#> Sepal.Width  -1.33862    0.12236 -10.940  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.6465 on 147 degrees of freedom
#> Multiple R-squared:  0.8677, Adjusted R-squared:  0.8659 
#> F-statistic:   482 on 2 and 147 DF,  p-value: < 2.2e-16

Data Prep

df_resid <- tibble(obs = 1:nrow(x_lm$model),
                   x_lm$model, 
                   resid = resid(x_lm),
                   fitted = fitted(x_lm),
                   sresid = rstandard(x_lm),
                   hatvals = hatvalues(x_lm),
                   jackknife =  rstudent(x_lm),
                   cooks = cooks.distance(x_lm)
                   )

Residual vs Fitted

df_resid |> 
  ggplot(aes(fitted, resid)) + geom_point() +
    geom_hline(yintercept = 0) +
    geom_smooth(se = F) +
    theme_bw()

QQ Plot

df_resid |> 
  ggplot(aes(sample = resid)) + 
    stat_qq() +
    stat_qq_line() +
    theme_bw()

Residuals vs X

df_resid |> 
  ggplot(aes(Sepal.Length, resid)) + geom_point() +
    geom_hline(yintercept = 0) +
    stat_smooth(se = F) +
    theme_bw()

df_resid |> 
  ggplot(aes(Sepal.Width, resid)) + geom_point() +
    geom_hline(yintercept = 0) +
    stat_smooth(se = F) +
    theme_bw()

Jackknife Residuals

df_resid |> 
  ggplot(aes(obs, jackknife)) + geom_point() +
    theme_bw()

Leverages

df_resid |> 
  ggplot(aes(obs, hatvals)) + geom_point() +
    theme_bw()

Cook’s Distance

df_resid |> 
  ggplot(aes(obs, cooks)) + 
    geom_point() +
    theme_bw()

Multicolinearity

library(car)
iris |>  with(cor(Sepal.Length, Sepal.Width))
#> [1] -0.1175698
vif(x_lm)
#> Sepal.Length  Sepal.Width 
#>     1.014016     1.014016

Issues with Linearity

Simulation Study

Simulate 1000 random variables from the following model:

\[ Y = 3 + 2log(X_1) + \epsilon \]

  • \(X_1\sim N(8,1)\)

  • \(\epsilon\sim N(0, 2)\)

x <- rnorm(1000, 8)
y <- 3 + 2 * log(x) + rnorm(1000, sd = sqrt(2))
df <- tibble(x, y)

Fit Model

Fit a model between \(Y\) and \(X_1\).

  • Fit Residuals Plots

Issues with Constant Variances

Simulation Study

Simulate 1000 random variables from the following model:

\[ Y = 3 + 2X_1 + \epsilon \]

  • \(X_1\sim N(8,1)\)

  • \(\epsilon\sim N(0, 2)\)

n <- 1000
x <- rnorm(n, 8, sd = 2)
yp <- 3 + 2 * x
err <- sapply(yp, function(i){rnorm(1, sd = abs(i)^2)}) 
y <- yp + err
df <- tibble(x, y)

Fit Model

Fit a model between \(Y\) and \(X_1\).

  • Fit Residuals Plots

Issues with Normality

Simulation Study

Simulate 1000 random variables from the following model:

\[ Y = 3 + 2X_1 + \epsilon \]

  • \(X_1\sim N(8,1)\)

  • \(\epsilon\sim Gamma(2, 1)\)

n <- 1000
x <- rnorm(n, 8, sd = 2)
y <- 3 + 2 * x + rgamma(n, 2)
df <- tibble(x, y)

Fit Model

Fit a model between \(Y\) and \(X_1\).

  • Fit Residuals Plots