Data Manipulation

Announcements

Learning Objectives

  • Directories/R Projects

  • Reading/Writing Data

  • Merging Data

  • dplyr Functions

Directories/R Projects

Scripting Style Guide

Tidyverse Style Guide

Directories

Directories is the file system located on your computer.

A file path indicates the location of certain files relative to your main (home) folder.

Working Directory

This is the folder where R will save and read all the files when the file path is not specified.

To get the current working directory:

getwd()

To set the working directory:

setwd("new_file_path")

R Projects

R Projects are ways for RStudio to organize files together for specific

Reading/Writing Data

Read Data

  • Easiest way is to have RStudio do it for you

  • Use Base R functions

  • Use readr package for tabular/text files

  • Use readxl package for excel files

  • Use haven package to read SAS, SPSS, or Stata files.

Example

data1 <- read_csv("files/data/data_3_1.csv")
data2 <- read_csv("/home/inqs/Repos/M408_S23/data/data_3_2.csv")

Example

Download the following zip file: data

Load data data_3_1.csv and data_3_2.csv.

Example

Load the following data: https://m408.inqs.info/files/data/data_3_3.csv

Write Data

Several functions that you can use to write functions from the readr and readxl.

I recommend using the write_csv function and provide csv files.

RData

RData is the data file specific for R.

Load Data

load("data.RData")

Write Data

save(RObject, file = "data.RData")

Merging Data

*_join()

  • The *_join() functions are used to merge 2 data frames together.

Example

Merge data sets data_3_1.csv and data_3_2.csv using the full_join()

dplyr Functions

mutate()

  • Adds a new variable to a data frame

  • Example:

mtcars %>%
  mutate(log_mpg = log(mpg)) %>%
  head()
#> mutate: new variable 'log_mpg' (double) with 25 unique values and 0% NA
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb  log_mpg
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 3.044522
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 3.044522
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 3.126761
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 3.063391
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 2.928524
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1 2.895912

mutate()

  • Each argument is a new variable added

  • Example:

mtcars %>%
  mutate(log_mpg = log(mpg), log_hp = log(hp)) %>%
  head()
#> mutate: new variable 'log_mpg' (double) with 25 unique values and 0% NA
#>         new variable 'log_hp' (double) with 22 unique values and 0% NA
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb  log_mpg
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 3.044522
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 3.044522
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 3.126761
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 3.063391
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 2.928524
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1 2.895912
#>                     log_hp
#> Mazda RX4         4.700480
#> Mazda RX4 Wag     4.700480
#> Datsun 710        4.532599
#> Hornet 4 Drive    4.700480
#> Hornet Sportabout 5.164786
#> Valiant           4.653960

Example

Using the penguins dataset from palmerpenguins, create a new variable that is the ln of flipper_length_mm.

select()

-This selects the variables to keep in the data frame

-Example:

mtcars %>%
  mutate(log_mpg = log(mpg), log_hp = log(hp)) %>%
  select(mpg, log_mpg, hp, log_hp) %>%
  head()
#> mutate: new variable 'log_mpg' (double) with 25 unique values and 0% NA
#>         new variable 'log_hp' (double) with 22 unique values and 0% NA
#> select: dropped 9 variables (cyl, disp, drat, wt, qsec, …)
#>                    mpg  log_mpg  hp   log_hp
#> Mazda RX4         21.0 3.044522 110 4.700480
#> Mazda RX4 Wag     21.0 3.044522 110 4.700480
#> Datsun 710        22.8 3.126761  93 4.532599
#> Hornet 4 Drive    21.4 3.063391 110 4.700480
#> Hornet Sportabout 18.7 2.928524 175 5.164786
#> Valiant           18.1 2.895912 105 4.653960

Example

Using the penguins dataset from palmerpenguins, only select the variables that are continuous data points.

filter()

  • Selects observations that satisfy a condition

  • Example:

mtcars %>%
  mutate(log_mpg = log(mpg), log_hp = log(hp)) %>%
  select(mpg, log_mpg, hp, log_hp) %>%
  filter(log_hp < 5) %>%
  head()
#> mutate: new variable 'log_mpg' (double) with 25 unique values and 0% NA
#>         new variable 'log_hp' (double) with 22 unique values and 0% NA
#> select: dropped 9 variables (cyl, disp, drat, wt, qsec, …)
#> filter: removed 15 rows (47%), 17 rows remaining
#>                 mpg  log_mpg  hp   log_hp
#> Mazda RX4      21.0 3.044522 110 4.700480
#> Mazda RX4 Wag  21.0 3.044522 110 4.700480
#> Datsun 710     22.8 3.126761  93 4.532599
#> Hornet 4 Drive 21.4 3.063391 110 4.700480
#> Valiant        18.1 2.895912 105 4.653960
#> Merc 240D      24.4 3.194583  62 4.127134

Example

Using the penguins dataset from palmerpenguins, filter the data set to look at penguins that are a Gentoo species.

if_else()

  • A function that provides T (1) if the condition is met and F (0) otherwise

  • Example:

mtcars %>%
  mutate(log_mpg = log(mpg), log_hp = log(hp)) %>%
  select(mpg, log_mpg, hp, log_hp) %>%
  filter(log_hp < 5) %>%
  mutate(hilhp = if_else(log_hp > mean(log_hp), 1, 0)) %>%
  head()
#> mutate: new variable 'log_mpg' (double) with 25 unique values and 0% NA
#>         new variable 'log_hp' (double) with 22 unique values and 0% NA
#> select: dropped 9 variables (cyl, disp, drat, wt, qsec, …)
#> filter: removed 15 rows (47%), 17 rows remaining
#> mutate: new variable 'hilhp' (double) with 2 unique values and 0% NA
#>                 mpg  log_mpg  hp   log_hp hilhp
#> Mazda RX4      21.0 3.044522 110 4.700480     1
#> Mazda RX4 Wag  21.0 3.044522 110 4.700480     1
#> Datsun 710     22.8 3.126761  93 4.532599     1
#> Hornet 4 Drive 21.4 3.063391 110 4.700480     1
#> Valiant        18.1 2.895912 105 4.653960     1
#> Merc 240D      24.4 3.194583  62 4.127134     0

Example

Using the penguins dataset from palmerpenguins, create a new variable that dichotomizes a penguin if their bill is longer than the average bill_length_mm.

group_by()

  • This groups the data frame

  • Example:

mtcars %>%
  mutate(log_mpg = log(mpg), log_hp = log(hp)) %>%
  select(mpg, log_mpg, hp, log_hp) %>%
  filter(log_hp < 5) %>%
  mutate(hilhp = if_else(log_hp > mean(log_hp), 1, 0)) %>%
  group_by(hilhp) %>%
  head()
#> mutate: new variable 'log_mpg' (double) with 25 unique values and 0% NA
#>         new variable 'log_hp' (double) with 22 unique values and 0% NA
#> select: dropped 9 variables (cyl, disp, drat, wt, qsec, …)
#> filter: removed 15 rows (47%), 17 rows remaining
#> mutate: new variable 'hilhp' (double) with 2 unique values and 0% NA
#> group_by: one grouping variable (hilhp)
#> # A tibble: 6 × 5
#> # Groups:   hilhp [2]
#>     mpg log_mpg    hp log_hp hilhp
#>   <dbl>   <dbl> <dbl>  <dbl> <dbl>
#> 1  21      3.04   110   4.70     1
#> 2  21      3.04   110   4.70     1
#> 3  22.8    3.13    93   4.53     1
#> 4  21.4    3.06   110   4.70     1
#> 5  18.1    2.90   105   4.65     1
#> 6  24.4    3.19    62   4.13     0

Example

Using the penguins dataset from palmerpenguins, group by species and find the average ln flipper_length_mm