Intro to Data Manipulation

Learning Objectives

  • Anonymous Functions

  • Pipes

  • Scripting

  • Data Manipulation

Anonymous Functions

Anonymous Functions

An anonymous function is a function that is not stored in an R object for the global environment. It can be thought of as a temporary function to complete a task. A common way to used an anonymous function is with an *apply() function

x <- 1:10
sapply(x, function(x) rnorm(1,x))
#>  [1]  0.9645593  2.4985323  3.6130830  3.4038356  6.6371189  3.9822256
#>  [7]  7.6365972  8.9805052 12.6891921  7.6383656
x <- 1:10
sapply(x, \(x) rnorm(1,1,x))
#>  [1]   1.1161187   2.8188918  -1.8337787   3.9964652  -3.8217755  -1.3573966
#>  [7]  -1.4840088  -0.1488438  19.3957272 -12.3938028

Example

Use an anonymous function to square all the values in the following vector:

# Use an anonymous function to calculate the square of each element in a vector
numbers <- 1:40

Example

Use an anonymous function to convert the vector from Fahrenheit to Celsius:

# Create a vector of temperatures in Fahrenheit
temperatures_f <- c(32, 68, 104, 50)

\[ C = \frac{5(F-32)}{9} \]

Pipes

Pipes

Pipes are used to pass the output from one function and use it as input for another function. The output is piped into the first argument of the next function. There are two main pipes: R’s base pipe and Magrittr’s pipes. You must download and install the magrittr package; and you will need to load it everytime:

library(magrittr)

Additionally, pipes can be used to chain functions together.

|>

Before R 4.1, R did not have a pipe in its main program. The base pipe, |>, will pipe the output of the first operation and use it as the input of the first argument of the next function.

x <- 1:40
x |> mean()
#> [1] 20.5

%>%

The magrittr pipe, %>%, operates the same way as |>. Below are a couple of examples

x <- 1:10
x %>%  mean()
#> [1] 5.5
x %>% sd
#> [1] 3.02765
x %>% rnorm(1, .)
#> [1] 2.281058

%$%

The exposition pipe, %$%, will expose the named elements, from a list or data frame, to the next function.

mtcars %$% plot(mpg, hp)

%T>%

The Tee pipe, %T>%, forward the output in the

sin_40 <- 1:40 %>% mean %T>% print %>% sin
#> [1] 20.5
print(sin_40)
#> [1] 0.9968298

%T>%

rnorm(100) %>% 
  matrix(ncol=2) %>% 
  sin() %T>% 
  plot() %>% 
  colSums()

Examples

Using the vector below, find the standard deviation using a pipe:

x <- rgamma(100, 1)
sd(x)

Examples

Chain pipe the previous results into the \(sin(x)\).

Examples

Chain pipe the previous results into \(e^x\).

Examples

Chain pipe the previous results into \(x^2+5x+4\)

Data Projects

Scripting

The structure a programming is important to ensure that all methods are executed properly.

Beginning of Script

## Todays data 
analysis_data <- format(Sys.time(),"%Y-%m-%d-%H-%M")

## R Packages
library(tidyverse)
library(magrittr)

## Functions
source("fxs.R")
Rcpp::sourceCpp("fxs.cpp")

## Data
df1 <- read_csv("file.csv")
df2 <- load("file.RData") %>% get

Middle of Script

## Pre Analysis
df1_prep <- Prep_data(df1)
df2_prep <- Prep_data(df2)

## Analysis
df1_analysis <- analyze(df1_prep)
df2_analysis <- analyze(df2_prep)

## Post Analysis
df1_post <- Prep_post(df1_anlysis)
df2_post <- Prep_post(df2_anlysis)

End of Script

## Save Results
res <- list(df1 = list(pre = df1_prep,
                       analysis = df1_analysis,
                       post = df1_post),
            df2 = list(pre = df2_prep,
                       analysis = df2_analysis,
                       post = df2_post))
file_name <- paste0("results_", analysis_data, ".RData")
save(res, file = file_name)

Keyboard Shortcuts

Below is a list of recommended keyboard shortcuts:

Shortcut Windows/Linux Mac
%>% Ctrl+Shift+M Cmd+Shift+M
Run Current Line Ctrl+Enter Cmd+Return
Run Current Chunk Ctrl+Shift+Enter Cmd+Shift+Enter
Knit Document Ctrl+Shift+K Cmd+Shift+K
Add Cursor Below Ctrl+Alt+Down Cmd+Alt+Down
Comment Line Ctrl+Shift+C Cmd+Shift+C

I recommend modify these keyboard shortcuts in RStudio

Shortcut Windows/Linux Mac
%in% Ctrl+Shift+I Cmd+Shift+I
%$% Ctrl+Shift+D Cmd+Shift+D
%T>% Ctrl+Shift+T Cmd+Shift+T

Note you will need to install the extraInserts package:

remotes::install_github('konradzdeb/extraInserts')

Data Manipulation

Data Manipulation

Tidyverse

Tidyverse is a collection of R packages used for data manipulation. The dplyr package is known as the grammar of data manipulation with a set

Verbs

  • mutate() adds new variables
  • select() selects variables
  • filter() filters data
  • if_else() conditional function that returns 2 values
  • group_by() a dataset is grouped by factors
  • summarise() provides summaries of data

Example

library(palmerpenguins)
sum_stats <- penguins %>% 
  drop_na %>% 
  filter(year==2007) %>% 
  group_by(island) %>% 
  summarise(mean = mean(bill_length_mm),
            sd = sd(bill_length_mm),
            median = median(bill_length_mm),
            n = length(bill_length_mm)) %>% 
  print
#> # A tibble: 3 × 5
#>   island     mean    sd median     n
#>   <fct>     <dbl> <dbl>  <dbl> <int>
#> 1 Biscoe     45.1  4.80   46.1    43
#> 2 Dream      44.7  5.64   45.4    45
#> 3 Torgersen  39.0  2.92   39.1    15