- left_join - keeps all rows from X and adds columns from Y to any that match cases in X; if there is no matching record from Y, the Y columns will be NA for those cases
- right_join - keeps all rows from Y and adds columns from Y; if there is no matching record in X, the new row from Y will have NAs for all X variables
- full_join - keeps all rows from both X and Y, matching where it can and filling in NAs where data are missing
- inner_join - only keeps rows in common between X and Y
- semi_join - does not add any columns from Y, but only keeps rows in X that have a matching record in Y
- anti_join - removes rows from X with a matching case in Y
For instance, with my reading dataset, I have multiple variables I could use as identifiers. Book.ID uniquely identifies each book in the set, and I could use that variable to match up additional information about the book. Author is another identifier, and I could use that to match up with a dataset that includes more information about the authors. It's okay that my book dataset includes multiple books by the same author, as long as the author information set I merge with only has one record for each author. I could even use the dates in the dataset to match up with a dataset providing information on what I was doing on those days, such as whether I was traveling, working, or at home. Any of those variables could be used as an identifier for a merge, depending on what research question I want to answer.
In the dummy_cols post, I used the left_join to merge the book genres with my reading data (download the genre flag file and reading set to test out this code):
reads2019 <- read_csv("~/Downloads/Blogging A to Z/ReadsNoGenre.csv", col_names = TRUE)
genres <- read_csv("~/Downloads/Blogging A to Z/genres.csv", col_names = TRUE)
reads2019 <- reads2019 %>% left_join(genres, by = "Book.ID")
reason <- read_csv("~/Downloads/Blogging A to Z/reason.csv", col_names = TRUE)
reads2019 <- reads2019 %>% left_join(reason, by = c("Title", "Author")) reads2019 %>% ggplot(aes(Reason)) + geom_bar() + coord_flip()
Let's try a full_join. I created a dataset based on a friend's reading from 2019. A full join will combine all rows from both datasets and merge any matching rows.
friendreads <- read_csv("~/Downloads/Blogging A to Z/friendreads.csv", col_names = TRUE)
reads2019 <- reads2019 %>% full_join(friendreads, by = c("Title"))
The nice thing about join is that the variables you join on don't need to have the same name (they do, however, need to be the same format, such as numeric, character, and so on). You'd just add an additional equal sign to tell which variables match up. Let's demonstrate a new join type with mismatching variable names while using one of my favorite R packages, tidytext.
library(tidytext) titlewords <- reads2019 %>% unnest_tokens(titleword, Title) titlewords <- titlewords %>% anti_join(stop_words, by = c("titleword" = "word"))
That's all for now. Next time, we'll talk about keeping and dropping variables from a dataset!