Deeply Trivial: J is for Join

Today, we'll start digging into the wonderful world of joins! The tidyverse offers several different types of joins between two datasets, X and Y:

left_join - keeps all rows from X and adds columns from Y to any that match cases in X; if there is no matching record from Y, the Y columns will be NA for those cases
right_join - keeps all rows from Y and adds columns from Y; if there is no matching record in X, the new row from Y will have NAs for all X variables
full_join - keeps all rows from both X and Y, matching where it can and filling in NAs where data are missing
inner_join - only keeps rows in common between X and Y
semi_join - does not add any columns from Y, but only keeps rows in X that have a matching record in Y
anti_join - removes rows from X with a matching case in Y

Joins make it essential to follow tidy data principles. You should always have a variable or set of variables that serve as identifiers, so that you can use those to match up data. And, in fact, you may have multiple identifiers depending on what kind of data you're merging together. For instance, at work, I have many identifers I use for different data problems. For item characteristics (such as item difficulty), we can match with unique item identifier. Multiple people may receive the same item, but the characteristics of each item should be unique. For matching users with their certification records, we have their unique board ID. Data should be clear of any unnecessary duplicates, because they can either cause the merge to stop and throw out error messages or create an incredibly large dataset that contains all possible combinations of merges. Always know what your unit of measurement is (People? Books? Experimental conditions?), and figure out what variable or variables allow you to uniquely identify cases by that unit.

For instance, with my reading dataset, I have multiple variables I could use as identifiers. Book.ID uniquely identifies each book in the set, and I could use that variable to match up additional information about the book. Author is another identifier, and I could use that to match up with a dataset that includes more information about the authors. It's okay that my book dataset includes multiple books by the same author, as long as the author information set I merge with only has one record for each author. I could even use the dates in the dataset to match up with a dataset providing information on what I was doing on those days, such as whether I was traveling, working, or at home. Any of those variables could be used as an identifier for a merge, depending on what research question I want to answer.

In the dummy_cols post, I used the left_join to merge the book genres with my reading data (download the genre flag file and reading set to test out this code):

library(tidyverse)

## -- Attaching packages ------------------------------------------------------------------------------- tidyverse 1.3.0 --

## <U+2713> ggplot2 3.2.1     <U+2713> purrr   0.3.3
## <U+2713> tibble  2.1.3     <U+2713> dplyr   0.8.3
## <U+2713> tidyr   1.0.0     <U+2713> stringr 1.4.0
## <U+2713> readr   1.3.1     <U+2713> forcats 0.4.0

## -- Conflicts ---------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

reads2019 <- read_csv("~/Downloads/Blogging A to Z/ReadsNoGenre.csv",
                      col_names = TRUE)

## Parsed with column specification:
## cols(
##   Title = col_character(),
##   Pages = col_double(),
##   date_started = col_character(),
##   date_read = col_character(),
##   Book.ID = col_double(),
##   Author = col_character(),
##   AdditionalAuthors = col_character(),
##   AverageRating = col_double(),
##   OriginalPublicationYear = col_double(),
##   read_time = col_double(),
##   MyRating = col_double(),
##   Gender = col_double(),
##   NewRating = col_double(),
##   FinalRating = col_double()
## )

genres <- read_csv("~/Downloads/Blogging A to Z/genres.csv",
                   col_names = TRUE)

## Parsed with column specification:
## cols(
##   Book.ID = col_double(),
##   Fiction = col_double(),
##   Childrens = col_double(),
##   Fantasy = col_double(),
##   SciFi = col_double(),
##   Mystery = col_double(),
##   SelfHelp = col_double()
## )

reads2019 <- reads2019 %>%
  left_join(genres, by = "Book.ID")

For joins, variable names have to be formatted as strings (meaning, in quotes). We're joining on a single variable (Book.ID), which has the same name in both files, so we only need one string after by = . If we were joining on two variables, we'd need a list of two string. To demonstrate, I created a file by Title and Author that gives the main reason why I read the book: 1) it was recommended to me (either single book or a whole series), 2) it was a personal goal to read/re-read this book, 3) I randomly found the book at a shop, or 4) I planned to buy that specific book (usually because I follow the author's work). Let's merge that file in to reads2019 by Title and Author:

reason <- read_csv("~/Downloads/Blogging A to Z/reason.csv",
                   col_names = TRUE)

## Parsed with column specification:
## cols(
##   Title = col_character(),
##   Author = col_character(),
##   Reason = col_character()
## )

reads2019 <- reads2019 %>%
  left_join(reason, by = c("Title", "Author"))

reads2019 %>%
  ggplot(aes(Reason)) +
  geom_bar() +
  coord_flip()

Based on this figure, the #1 reason I read a book or series of books is because of a recommendation from someone. Random finds and personal goals are tied for 2nd.

Let's try a full_join. I created a dataset based on a friend's reading from 2019. A full join will combine all rows from both datasets and merge any matching rows.

friendreads <- read_csv("~/Downloads/Blogging A to Z/friendreads.csv",
                        col_names = TRUE)

## Parsed with column specification:
## cols(
##   Title = col_character(),
##   Author = col_character()
## )

reads2019 <- reads2019 %>%
  full_join(friendreads, by = c("Title"))

Since I joined only by title, and both datasets include an Author variable, the dataset was changed to have Author.x and Author.y. So every row that has NA for Author.x is in my friend's dataset and not mine, and every row that has NA for Author.y is in my dataset but not my friend's. Rows that contain values for both are our areas of overlap.

The nice thing about join is that the variables you join on don't need to have the same name (they do, however, need to be the same format, such as numeric, character, and so on). You'd just add an additional equal sign to tell which variables match up. Let's demonstrate a new join type with mismatching variable names while using one of my favorite R packages, tidytext.

library(tidytext)

titlewords <- reads2019 %>%
  unnest_tokens(titleword, Title)

titlewords <- titlewords %>%
  anti_join(stop_words, by = c("titleword" = "word"))

First, I tokenized the book titles, which expanded the dataset so that each word in the title received its own row. Then, I used anti_join to remove stop words (common words, such as "a", "the", and "of"). Because my tokenized dataset used the variable name titleword, I needed to tell R that it should match with the variable called word to do the anti-join. Both variables are characters, so the types match and can easily be joined. I could then do further analysis to see what the most common words are among my book titles.

That's all for now. Next time, we'll talk about keeping and dropping variables from a dataset!

Deeply Trivial

Saturday, April 11, 2020

J is for Join

No comments:

Post a Comment