Data Manipulation

Learning Objectives

  • tidyr Functions

  • Wide to Long Example

  • Plotting with ggplot2

Tidyr

tidyr Functions

A set of functions that will tidy up a data set such that:

  • Every Column is a variable

  • Every Row is an observation

  • Every Cell is a single value

pivot_longer()

  • The pivot_longer() function grabs the variables that repeated in an observation places them in one variable

pivot_wider()

  • The pivot_wider() function then converts long data to wide data.

separate()

  • The separate() function will separate a variable to multiple variables:

Example

Wide to Long Data Example

We work on converting data from wide to long using the functions in the tidyr package. For many statistical analysis, long data is necessary.

Load Data

Use the read_csv() to read data_3_4.csv into an object called data1;

data1 <- read_csv(file="http://www.inqs.info/files/hiss_3/data_3_4.csv")

Wide Data

Long Data

pivot_longer()

  • The pivot_longer() function grabs the variables that repeated in an observation places them in one variable:
df1 <- data1 %>% 
  pivot_longer(cols=`v1/mean`:`v4/median`,
               names_to = "measurement",
               values_to = "value")
#> pivot_longer: reorganized (v1/mean, v1/sd, v1/median, v2/mean, v2/sd, …) into (measurement, value) [was 1000x13, now 12000x3]

separate()

  • The separate() function will separate a variable to multiple variables:
df2 <- data1 %>% 
  pivot_longer(cols=`v1/mean`:`v4/median`,
               names_to = "measurement",
               values_to = "value") %>% 
  separate(col=measurement,into=c("time","stat"),sep="/")
#> pivot_longer: reorganized (v1/mean, v1/sd, v1/median, v2/mean, v2/sd, …) into (measurement, value) [was 1000x13, now 12000x3]

pivot_wider()

  • The pivot_wider() function then converts long data to wide data.
df3 <- data1 %>% 
  pivot_longer(`v1/mean`:`v4/median`,
               names_to = "measurement", 
               values_to = "value") %>% 
  separate(measurement,c("time","stat"),sep="/") %>% 
  pivot_wider(names_from = stat,
              values_from = value)
#> pivot_longer: reorganized (v1/mean, v1/sd, v1/median, v2/mean, v2/sd, …) into (measurement, value) [was 1000x13, now 12000x3]
#> pivot_wider: reorganized (stat, value) into (mean, sd, median) [was 12000x4, now 4000x5]

ggplot2

ggplot2

ggplot2 is an R package used to create plots. The main idea is to use a data frame and a set of aesthetics (variables in the data frame) to create a base plot. Then, ggplot2 will layer geometries (plots) to the base plot to create a data visualization.

All new changes to the plot are layered on with the + symbol.

Base Plot

mtcars |> ggplot(aes(x = mpg)) 

Histogram

mtcars |> ggplot(aes(x = mpg)) +
  geom_histogram()

Box Plot

mtcars |> ggplot(aes(x = mpg)) +
  geom_boxplot()

Density Plot

mtcars |> ggplot(aes(x = mpg)) +
  geom_density()

Box Plot By Category

mtcars |> ggplot(aes(x = mpg, y = as.factor(cyl))) +
  geom_boxplot()

Density Plot By Category

mtcars |> ggplot(aes(x = mpg, color = as.factor(cyl))) +
  geom_density()

Scatter Plot

mtcars |> ggplot(aes(x = wt, y = mpg)) +
  geom_point()

Scatter Plot by Group

mtcars |> ggplot(aes(x = wt, y = mpg, color = as.factor(cyl))) +
  geom_point()

Add Regression Line

mtcars |> ggplot(aes(x = wt, y = mpg)) +
  geom_point() +
  geom_smooth(method = "lm", se = F)

Smooth Line

mtcars |> ggplot(aes(x = wt, y = mpg)) +
  geom_point() +
  geom_smooth(se = F)

Regression Lines by Group

mtcars |> ggplot(aes(x = wt, y = mpg,
                  color = as.factor(cyl))) +
  geom_point() +
  geom_smooth(method = "lm", se = F)

Example

Using the penguins data set from palmerpenguins package. Create any plot and make it publication ready. Use the following resources to customize the plot: R Graphics Cookbook, R Graph Gallery, R Charts, and ggplot2