Multivariable Linear Regression
Simulation Study 1
Simulation Study 2
Matrix Formulation
Multivariable linear regression models are used when more than one explanatory variable is used to explain the outcome of interest.
To fit an additional continuous random variable to the model, we will only need to add it to the model:
\[ Y = \beta_0 +\beta_1 X_1 + \beta_2 X_2 \]
Looking at the penguins
from palmerpenguins
, fit a model between body_mass_g
as an outcome variable and flipper_length_mm
and bill_length_mm
as predictor variables.
#>
#> Call:
#> lm(formula = body_mass_g ~ flipper_length_mm + bill_length_mm,
#> data = penguins)
#>
#> Coefficients:
#> (Intercept) flipper_length_mm bill_length_mm
#> -5736.897 48.145 6.047
A categorical variable can be included in a model, but a reference category must be specified.
To fit a model with categorical variables, we must utilize dummy (binary) variables that indicate which category is being referenced. We use \(C-1\) dummy variables where \(C\) indicates the number of categories. When coded correctly, each category will be represented by a combination of dummy variables.
If we have 4 categories, we will need 3 dummy variables:
Cat 1 | Cat 2 | Cat 3 | Cat 4 | |
---|---|---|---|---|
Dummy 1 | 1 | 0 | 0 | 0 |
Dummy 2 | 0 | 1 | 0 | 0 |
Dummy 3 | 0 | 0 | 1 | 0 |
Which one is the reference category?
Looking at the penguins
from palmerpenguins
, fit a model between body_mass_g
as an outcome variable and flipper_length_mm
and islands
as predictor variables.
#>
#> Call:
#> lm(formula = body_mass_g ~ flipper_length_mm + island, data = penguins)
#>
#> Coefficients:
#> (Intercept) flipper_length_mm islandDream islandTorgersen
#> -4624.98 44.54 -262.18 -185.13
#> [1] "Biscoe" "Dream" "Torgersen"
Simulate 1000 random variables from the following model:
\[ Y = 3 + 2X_1 + 4X_2 + \epsilon \]
\(X_1\sim N(2,1)\)
\(X_2\sim N(-4,1)\)
\(\epsilon\sim N(0, 2)\)
Fit a model between \(Y\) and \(X_1\).
Repeat the process 1000 times. and answer the following questions:
On average does \(\beta_1\) get estimated correctly? Why?
What is the average model variance?
Instead of fitting a simple linear regression model. Fit a model that will include predictor \(X_2\). This can be done by adding \(X_2\) in R:
Modify your simulation study and see what happens to \(\beta_1\) and the model variance.
Simulate 1000 random variables from the following model:
\[ Y = 3 + 2log(X_1) + \epsilon \]
\(X_1\sim N(8,1)\)
\(\epsilon\sim N(0, 2)\)
Fit a model between \(Y\) and \(X_1\).
Repeat the process 1000 times. and answer the following questions:
On average does \(\beta_1\) get estimated correctly? Why?
What is the average model variance?
Fit a simple linear regression model using \(\log(X_1)\) instead.
Modify your simulation study and see what happens to \(\beta_1\) and the model variance.
\[ Y_i = \boldsymbol X_i^\mathrm T \boldsymbol \beta + \epsilon_i \]
\(Y_i\): Outcome Variable
\(\boldsymbol X_i\): Predictors
\(\boldsymbol \beta\): Coefficients
\(\epsilon_i\): error term
Fit the following models using matrix formulas instead of the lm
function.
\[ bmg = \beta_0+\beta_1 flipper\_length\_mm + \beta_2 bill\_length\_mm \]
\[ bmg = \beta_0 +\beta_1 flipper\_length\_mm + \beta_2 dream + \beta_3 biscoe \]
\[ \boldsymbol\beta = (\boldsymbol X^\mathrm T \boldsymbol X)^{-1} \boldsymbol X^\mathrm T \boldsymbol Y \]
Define \(\boldsymbol X\) in each case.
\[ bmg = \\ \beta_0+ \beta_1bill\_length\_mm + \beta_2 dream +\beta_3 biscoe \]