Monday, April 6, 2020

E is for Exposition Pipe

For the letter E, I want to talk about a set of operators provided by tidyverse (specifically the magrittr package) that makes for much prettier, easier-to-read code: pipes. The main pipe %>% pushes the object to the left of it forward into functions on the right, so that instead of coding f(x), it would be x %>% f(). This lets you chain together multiple functions to apply to a single object.

One of these pipes is known as the exposition pipe, and it looks like this: %$%. For many base R functions, there is no data attribute (i.e., data = ...), and so when you specify one or more variables, you have to also specify the data frame (which would be formatted as data$variable). For instance, to run a correlation between two variables, you'd need to write the code as cor(data$variable1, data$variable2).

The exposition pipe exposes the columns of a data frame to the function, so you only need to specify the data frame once. This cleans up your code, while also allowing these functions to be embedded in a larger string of data wrangling and manipulation. Here's how I can use the exposition code in action (but first, I've created a new version of the dataset that has a single Rating variable, which we created previously with the coalesce function, so you'll want to download and use that file going forward):
library(tidyverse)
## -- Attaching packages ------------------------------------------- tidyverse 1.3.0 --
## <U+2713> ggplot2 3.2.1     <U+2713> purrr   0.3.3
## <U+2713> tibble  2.1.3     <U+2713> dplyr   0.8.3
## <U+2713> tidyr   1.0.0     <U+2713> stringr 1.4.0
## <U+2713> readr   1.3.1     <U+2713> forcats 0.4.0
## -- Conflicts ---------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(magrittr)
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
reads2019 <- read_csv("~/Downloads/Blogging A to Z/SarasReads2019_allrated.csv", col_names = TRUE)
## Parsed with column specification:
## cols(
##   Title = col_character(),
##   Pages = col_double(),
##   date_started = col_character(),
##   date_read = col_character(),
##   Book.ID = col_double(),
##   Author = col_character(),
##   AdditionalAuthors = col_character(),
##   AverageRating = col_double(),
##   OriginalPublicationYear = col_double(),
##   read_time = col_double(),
##   MyRating = col_double(),
##   Gender = col_double(),
##   Fiction = col_double(),
##   Childrens = col_double(),
##   Fantasy = col_double(),
##   SciFi = col_double(),
##   Mystery = col_double(),
##   SelfHelp = col_double()
## )
reads2019 %$%
  cor(MyRating, read_time)
## [1] -0.03762191
This is the correlation between my rating of the book and the number of days it took to read it. It's a negative correlation, meaning the more I liked the book, the faster I got through it, which makes sense, but it's also very weak. But what if I wanted to create a variable and then perform a correlation with that new variable? I can do that by combining the main pipe and the exposition pipe:
reads2019 %>%
  mutate(DifRating = MyRating - AverageRating) %$%
  cor(DifRating, read_time)
## [1] 0.007745212
The variable I created is the difference between my rating and the average rating from all Goodreads users. The correlation is basically 0, so there seems to be no relationship between how much more highly I rate a book (compared to others) and how quickly I finish it.

Tomorrow, we'll talk about filters!


2 comments:

  1. or you could use %>% with(cor(MyRating, read_time))

    base R has many undiscovered functions.

    ReplyDelete
    Replies
    1. My favorite base R function is library(tidyverse)

      Delete