There are two packages - geniusR and geniusr - which will do this. I played with both and found geniusR easier to use. Neither is perfect, but what is perfect, anyway?
To install geniusR, you'll use a different method than usual - you'll need to install the package devtools, then call the install_github function to download the R package directly from GitHub.
Now you'll want to load geniusR and tidyverse so we can work with our data.
library(geniusR) library(tidyverse)
For today's demonstration, I'll be working with data from two artists I love: Taylor Swift and Lorde. Both dropped new albums last year, Reputation and Melodrama, respectively, and both, though similar in age and friends with each other, have very different writing and musical styles.
geniusR has a function genius_album that will download lyrics from an entire album, labeling it by track.
swift_lyrics <- genius_album(artist="Taylor Swift", album="Reputation")
lorde_lyrics <- genius_album(artist="Lorde", album="Melodrama")
Now we want to tokenize our datasets, remove stop words, and count word frequency - this code should look familiar, except this time, I'm combining them using the pipeline symbol (%>%) from the tidyverse, which allows you to string together multiple functions without having to nest them.
library(tidytext) tidy_swift <- swift_lyrics %>% unnest_tokens(word,lyric) %>% anti_join(stop_words) %>% count(word, sort=TRUE)
## # A tibble: 6 x 2 ## word n ## <chr> <int> ## 1 call 46 ## 2 wanna 37 ## 3 ooh 35 ## 4 ha 34 ## 5 ah 33 ## 6 time 32
tidy_lorde <- lorde_lyrics %>% unnest_tokens(word,lyric) %>% anti_join(stop_words) %>% count(word, sort=TRUE)
## # A tibble: 6 x 2 ## word n ## <chr> <int> ## 1 boom 40 ## 2 love 26 ## 3 shit 24 ## 4 dynamite 22 ## 5 homemade 22 ## 6 light 22
Looking at the top 6 words for each, it doesn't look like there will be a lot of overlap. But let's explore that, shall we? Lorde's album is 3 tracks shorter than Taylor Swift's. To make sure our word comparisons are meaningful, I'll create new variables that take into account total number of words, so each word metric will be a proportion, allowing for direct comparisons. And because I'll be joining the datasets, I'll be sure to label these new columns by artist name.
tidy_swift <- tidy_swift %>% rename(swift_n = n) %>% mutate(swift_prop = swift_n/sum(swift_n)) tidy_lorde <- tidy_lorde %>% rename(lorde_n = n) %>% mutate(lorde_prop = lorde_n/sum(lorde_n))
There are multiple types of joins available in the tidyverse. I used an anti_join to remove stop words. Today, I want to use a full_join, because I want my final dataset to retain all words from both artists. When one dataset contributes a word not found in the other artist's set, it will fill those variables in with missing values.
compare_words <- tidy_swift %>% full_join(tidy_lorde, by = "word") summary(compare_words)
## word swift_n swift_prop lorde_n ## Length:957 Min. : 1.000 Min. :0.00050 Min. : 1.0 ## Class :character 1st Qu.: 1.000 1st Qu.:0.00050 1st Qu.: 1.0 ## Mode :character Median : 1.000 Median :0.00050 Median : 1.0 ## Mean : 3.021 Mean :0.00152 Mean : 2.9 ## 3rd Qu.: 3.000 3rd Qu.:0.00151 3rd Qu.: 3.0 ## Max. :46.000 Max. :0.02321 Max. :40.0 ## NA's :301 NA's :301 NA's :508 ## lorde_prop ## Min. :0.0008 ## 1st Qu.:0.0008 ## Median :0.0008 ## Mean :0.0022 ## 3rd Qu.:0.0023 ## Max. :0.0307 ## NA's :508
The final dataset contains 957 tokens - unique words - and the NAs tell how many words are only present in one artist's corpus. Lorde uses 301 words Taylor Swift does not, and Taylor Swift uses 508 words that Lorde does not. That leaves 148 words on which they overlap.
There are many things we could do with these data, but let's visualize words and proportions, with one artist on the x-axis and the other on the y-axis.
ggplot(compare_words, aes(x=swift_prop, y=lorde_prop)) + geom_abline() + geom_text(aes(label=word), check_overlap=TRUE, vjust=1.5) + labs(y="Lorde", x="Taylor Swift") + theme_classic()
## Warning: Removed 809 rows containing missing values (geom_text).
The warning lets me know there are 809 rows with missing values - those are the words only present in one artist's corpus. Words that fall on or near the line are used at similar rates between artists. Words above the line are used more by Lorde than Taylor Swift, and words below the line are used more by Taylor Swift than Lorde. This tells us that, for instance, Lorde uses "love," "light," and, yes, "shit," more than Swift, while Swift uses "call," "wanna," and "hands" more than Lorde. They use words like "waiting," "heart," and "dreams" at similar rates. Rates are low overall, but if you look at the max values for the proportion variables, Swift's most common word only accounts for about 2.3% of her total words; Lorde's most common word only accounts for about 3.1% of her total words.
This highlights why it's important to remove stop words for these types of analyses; otherwise, our datasets and chart would be full of words like "the," "a", and "and."
Next Statistics Sunday, we'll take a look at sentiment analysis!
