Saturday, April 25, 2020

V is for Verbs

In this series, I've covered five terms for data manipulation:
  • arrange
  • filter
  • mutate
  • select
  • summarise
These are the verbs that make up the grammar of data manipulation. They all work with group_by to perform these functions groupwise.

There are scoped versions of these verbs, which add _all, _if, or _at, that allow you to perform these verbs on multiple variables simultaneously. For instance, I could get means for all of my numeric variables like this. (Quick note: I created an updated reading dataset that has all publication years filled in. You can download it here.)
library(tidyverse)
## -- Attaching packages ------------------------------------------- tidyverse 1.3.0 --
## <U+2713> ggplot2 3.2.1     <U+2713> purrr   0.3.3
## <U+2713> tibble  2.1.3     <U+2713> dplyr   0.8.3
## <U+2713> tidyr   1.0.0     <U+2713> stringr 1.4.0
## <U+2713> readr   1.3.1     <U+2713> forcats 0.4.0
## -- Conflicts ---------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
reads2019 <- read_csv("~/Downloads/Blogging A to Z/SaraReads2019_allchanges.csv",
                      col_names = TRUE)
## Parsed with column specification:
## cols(
##   Title = col_character(),
##   Pages = col_double(),
##   date_started = col_character(),
##   date_read = col_character(),
##   Book.ID = col_double(),
##   Author = col_character(),
##   AdditionalAuthors = col_character(),
##   AverageRating = col_double(),
##   OriginalPublicationYear = col_double(),
##   read_time = col_double(),
##   MyRating = col_double(),
##   Gender = col_double(),
##   Fiction = col_double(),
##   Childrens = col_double(),
##   Fantasy = col_double(),
##   SciFi = col_double(),
##   Mystery = col_double(),
##   SelfHelp = col_double()
## )
reads2019 %>%
  summarise_if(is.numeric, list(mean))
## # A tibble: 1 x 13
##   Pages Book.ID AverageRating OriginalPublica… read_time MyRating Gender Fiction
##   <dbl>   <dbl>         <dbl>            <dbl>     <dbl>    <dbl>  <dbl>   <dbl>
## 1  341.  1.36e7          3.94            1989.      3.92     4.14  0.310   0.931
## # … with 5 more variables: Childrens <dbl>, Fantasy <dbl>, SciFi <dbl>,
## #   Mystery <dbl>, SelfHelp <dbl>
This function generated the mean for every numeric variable in my dataset. But even though they're all numeric, the mean isn't the best statistic for many of them, for instance book ID or publication year. We could just generate means for specific variables with summarise_at.
reads2019 %>%
  summarise_at(vars(Pages, AverageRating, read_time, MyRating), list(mean))
## # A tibble: 1 x 4
##   Pages AverageRating read_time MyRating
##   <dbl>         <dbl>     <dbl>    <dbl>
## 1  341.          3.94      3.92     4.14
You can also request more than one piece of information in your list, and request that R create a new label for each variable.
numeric_summary <- reads2019 %>%
  summarise_at(vars(Pages, AverageRating, read_time, MyRating), list("mean" = mean, "median" = median))
I use the basic verbs anytime I use R. I only learned about scoped verbs recently, and I'm sure I'll add them to my toolkit over time.

Next week is the last week of Blogging A to Z! See you then!

No comments:

Post a Comment