Thursday, April 2, 2020

B is for bind_rows

Moving on to the letter B, today we'll talk about merging datasets that contain the same variables but add new cases. This is easily done with bind_rows. Let's say I realized I forgot to log some of the books I read last year, and I wanted to merge those in to my existing dataset. I selected a handful of books from my to-read list, generated some read time and rating data, and saved the results in a csv file (which you can find here). Now I want to load my existing dataset and the new one:

library(tidyverse)

## -- Attaching packages ------------------------------------------- tidyverse 1.3.0 --
## <U+2713> ggplot2 3.2.1     <U+2713> purrr   0.3.3
## <U+2713> tibble  2.1.3     <U+2713> dplyr   0.8.3
## <U+2713> tidyr   1.0.0     <U+2713> stringr 1.4.0
## <U+2713> readr   1.3.1     <U+2713> forcats 0.4.0
## -- Conflicts ---------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

reads2019 <- read_csv("~/Downloads/Blogging A to Z/SarasReads2019.csv", col_names = TRUE)

## Parsed with column specification:
## cols(
##   Title = col_character(),
##   Pages = col_double(),
##   date_started = col_character(),
##   date_read = col_character(),
##   Book.ID = col_double(),
##   Author = col_character(),
##   AdditionalAuthors = col_character(),
##   AverageRating = col_double(),
##   OriginalPublicationYear = col_double(),
##   read_time = col_double(),
##   MyRating = col_double(),
##   Gender = col_double(),
##   Fiction = col_double(),
##   Childrens = col_double(),
##   Fantasy = col_double(),
##   SciFi = col_double(),
##   Mystery = col_double(),
##   SelfHelp = col_double()
## )

addreads <- read_csv("~/Downloads/Blogging A to Z/SarasAdds.csv")

## Parsed with column specification:
## cols(
##   Title = col_character(),
##   Pages = col_double(),
##   date_started = col_character(),
##   date_read = col_character(),
##   Book.ID = col_double(),
##   Author = col_character(),
##   AdditionalAuthors = col_character(),
##   AverageRating = col_double(),
##   OriginalPublicationYear = col_double(),
##   read_time = col_double(),
##   MyRating = col_double(),
##   Gender = col_double(),
##   Fiction = col_double(),
##   Childrens = col_double(),
##   Fantasy = col_double(),
##   SciFi = col_double(),
##   Mystery = col_double(),
##   SelfHelp = col_double()
## )

Now we just bind the two datasets together:

reads2019 <- reads2019 %>%
  bind_rows(addreads)

Did these additions change the ordering by page length?

reads2019 <- reads2019 %>%
  arrange(desc(Pages), Author)

head(reads2019)

## # A tibble: 6 x 18
##   Title Pages date_started date_read Book.ID Author AdditionalAutho…
##   <chr> <dbl> <chr>        <chr>       <dbl> <chr>  <chr>           
## 1 The …  1216 6/12/2019    6/18/2019  3.30e1 Tolki… <NA>            
## 2 The …  1181 6/12/2019    6/17/2019  1.86e7 Atwoo… <NA>            
## 3 It     1156 8/14/2019    8/21/2019  2.79e7 King,… <NA>            
## 4 1Q84    925 9/3/2019     9/10/2019  1.04e7 Murak… Jay Rubin, Phil…
## 5 Inso…   890 8/10/2019    8/13/2019  1.06e4 King,… Bettina Blanch …
## 6 The …   592 8/18/2019    8/23/2019  1.16e4 King,… <NA>            
## # … with 11 more variables: AverageRating <dbl>, OriginalPublicationYear <dbl>,
## #   read_time <dbl>, MyRating <dbl>, Gender <dbl>, Fiction <dbl>,
## #   Childrens <dbl>, Fantasy <dbl>, SciFi <dbl>, Mystery <dbl>, SelfHelp <dbl>

It did! The longest book is now The Lord of the Rings, at 1216 pages, and number two is The MaddAddam Trilogy, 1181 pages.

This is a pretty easy trick. Later on in this series, we'll talk about combining datasets that share cases but add new variables - joins - which is one of the times the tidy data mindset becomes very important.

No comments:

Post a Comment