Tuesday, May 22, 2018

How Has Taylor Swift's Word Choice Changed Over Time?

How Has Taylor Swift's Word Choice Changed Over Time? Sunday night was a big night for Taylor Swift - not only was she nominated for multiple Billboard Music Awards; she took home Top Female Artist and Top Selling Album. So I thought it was a good time for some more Taylor Swift-themed statistical analysis.

When I started this blog back in 2011, my goal was to write deep thoughts on trivial topics - specifically, to overthink and overanalyze pop culture and related topics that appear fluffy until you really dig into them. Recently, I've been blogging more about statistics, research, R, and data science, and I've loved getting to teach and share.

But sometimes, you just want to overthink and overanalyze pop culture.

So in a similar vein to the text analysis I've been demonstrating on my blog, I decided to answer a question I'm sure we all have - as Taylor Swift moved from country sweetheart to mega pop star, how have the words she uses in her songs changed?

I've used the geniusR package on a couple posts, and I'll be using it again today to answer this question. I'll be pulling in some additional code, some based on code from the Text Mining with R: A Tidy Approach book I recently devoured, some written to try to tackle this problem I've created for myself to solve. I've shared all my code and tried to credit those who helped me write it where I can.

First, we want to pull in the names of Taylor Swift's 6 studio albums. I found these and their release dates on Wikipedia. While there are only 6 and I could easily copy and paste them to create my data frame, I wanted to pull that data directly from Wikipedia, to write code that could be used on a larger set in the future. Thanks to this post, I could, with a couple small tweaks.

library(rvest)
## Loading required package: xml2
TSdisc <- 'https://en.wikipedia.org/wiki/Taylor_Swift_discography'

disc <- TSdisc %>%
  read_html() %>%
  html_nodes(xpath = '//*[@id="mw-content-text"]/div/table[2]') %>%
  html_table(fill = TRUE)

Since html() is deprecated, I replaced it with read_html(), and I got errors if I didn't add fill = TRUE. The result is a list of 1, with an 8 by 14 data frame within that single list object. I can pull that out as a separate data frame.

TS_albums <- disc[[1]]

The data frame requires a little cleaning. First up, there are 8 rows, but only 6 albums. Because the Wikipedia table had a double header, the second header was read in as a row of data, so I want to delete that, because I only care about the first two columns anyway. The last row contains a footnote that was included with the table. So I removed those two rows, first and last, and dropped the columns I don't need. Second, the information I want with release date was in a table cell along with record label and formats (e.g., CD, vinyl). I don't need those for my purposes, so I'll only pull out the information I want and drop the rest. Finally, I converted year from character to numeric - this becomes important later on.

library(tidyverse)
TS_albums <- TS_albums[2:7,1:2] %>%
  separate(`Album details`, c("Released","Month","Day","Year"),
           extra='drop') %>%
  select(c("Title","Year"))

TS_albums$Year<-as.numeric(TS_albums$Year)

I asked geniusR to download lyrics for all 6 albums. (Note: this code may take a couple minutes to run.) It nests all of the individual album data, including lyrics, into a single column, so I just need to unnest that to create a long file, with album title and release year applied to each unnested line.

library(geniusR)

TS_lyrics <- TS_albums %>%
  mutate(tracks = map2("Taylor Swift", Title, genius_album))
## Joining, by = c("track_title", "track_n", "track_url")
## Joining, by = c("track_title", "track_n", "track_url")
## Joining, by = c("track_title", "track_n", "track_url")
## Joining, by = c("track_title", "track_n", "track_url")
## Joining, by = c("track_title", "track_n", "track_url")
## Joining, by = c("track_title", "track_n", "track_url")
TS_lyrics <- TS_lyrics %>%
  unnest(tracks)

Now we'll tokenize our lyrics data frame, and start doing our word analysis.

library(tidytext)

tidy_TS <- TS_lyrics %>%
  unnest_tokens(word, lyric) %>%
  anti_join(stop_words)
## Joining, by = "word"
tidy_TS %>%
  count(word, sort = TRUE)
## # A tibble: 2,024 x 2
##    word      n
##    <chr> <int>
##  1 time    198
##  2 love    180
##  3 baby    118
##  4 ooh     104
##  5 stay     89
##  6 night    85
##  7 wanna    84
##  8 yeah     83
##  9 shake    80
## 10 ey       72
## # ... with 2,014 more rows

There are a little over 2,000 unique words across TS's 6 albums. But how have they changed over time? To examine this, I'll create a dataset that counts word by year (or album, really). Then I'll use a binomial regression model to look at changes over time, one model per word. In their book, Julia Silge and David Robinson demonstrated how to use binomial regression to examine word use on the authors' Twitter accounts over time, including an adjustment to the p-values to correct for multiple comparisons. So I based on my code off that.

words_by_year <- tidy_TS %>%
  count(Year, word) %>%
  group_by(Year) %>%
  mutate(time_total = sum(n)) %>%
  group_by(word) %>%
  mutate(word_total = sum(n)) %>%
  ungroup() %>%
  rename(count = n) %>%
  filter(word_total > 50)

nested_words <- words_by_year %>%
  nest(-word)

word_models <- nested_words %>%
  mutate(models = map(data, ~glm(cbind(count, time_total) ~ Year, .,
                                 family = "binomial")))

This nests our regression results in a data frame called word_models. While I could unnest and keep all, I don't care about every value the GLM gives me. What I care about is the slope for Year, so the filter selects only that slope and the associated p-value. I can then filter to select the significant/marginally significant slopes for plotting (p < 0.1).

library(broom)

slopes <- word_models %>%
  unnest(map(models, tidy)) %>%
  filter(term == "Year") %>%
  mutate(adjusted.p.value = p.adjust(p.value))

top_slopes <- slopes%>%
  filter(adjusted.p.value < 0.1) %>%
  select(-statistic, -p.value)

This gives me five words that show changes in usage over time: bad, call, dancing, eyes, and yeah. We can plot those five words to see how they've changed in usage over her 6 albums. And because I still have my TS_albums data frame, I can use that information to label the axis of my plot (which is why I needed year to be numeric). I also added a vertical line and annotations to note where TS believes she shifted from country to pop.

library(scales)
words_by_year %>%
  inner_join(top_slopes, by = "word") %>%
  ggplot(aes(Year, count/time_total, color = word, lty = word)) +
  geom_line(size = 1.3) +
  labs(x = NULL, y = "Word Frequency") +
  scale_x_continuous(breaks=TS_albums$Year,
                     labels=TS_albums$Title) +
  scale_y_continuous(labels=scales::percent) +
  geom_vline(xintercept = 2014) +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank()) +
  annotate("text", x = c(2009.5,2015.5), y = c(0.025,0.025),
           label = c("Country", "Pop") , size=5)


The biggest change appears to be in the word "call," which she didn't use at all in her self-titled album, and used at low rates until "1989" and, especially, "Reputation." I can ask for a few examples of "call" in her song lyrics, with grep.

library(expss)
callsubset <- TS_lyrics[grep("call", TS_lyrics$lyric),]
callsubset <- callsubset %>%
  select(Title, Year, track_title, lyric)
set.seed(2012)
callsubset<-callsubset[sample(nrow(callsubset), 3), ]
callsubset<-callsubset[order(callsubset$Year),]
as.etable(callsubset, rownames_as_row_labels = FALSE)
Title  Year   track_title   lyric 
 Speak Now  2010 Back to December (Acoustic) When your birthday passed, and I didn't call
 Red  2012 All Too Well And you call me up again just to break me like a promise
 Reputation  2017 Call It What You Want Call it what you want, call it what you want, call it
On the other hand, she doesn't sing about "eyes" as much now that she's moved from country to pop.

eyessubset <- TS_lyrics[grep("eyes", TS_lyrics$lyric),]
eyessubset <- eyessubset %>%
  select(Title, Year, track_title, lyric)
set.seed(415)
eyessubset<-eyessubset[sample(nrow(eyessubset), 3), ]
eyessubset<-eyessubset[order(eyessubset$Year),]
as.etable(eyessubset, rownames_as_row_labels = FALSE)
Title  Year   track_title   lyric 
 Taylor Swift  2006 A Perfectly Good Heart And realized by the distance in your eyes that I would be the one to fall
 Speak Now  2010 Better Than Revenge I'm just another thing for you to roll your eyes at, honey
 Red  2012 State of Grace Just twin fire signs, four blue eyes
Bet you'll never listen to Taylor Swift the same way again.

A few notes: I opted to examine any slopes with p < 0.10, which is greater than conventional levels of significance; if you look at the adjusted p-value column, though, you'll see that 4 of the 5 are < 0.05 and one is only slightly greater than 0.05. But I made the somewhat arbitrary choice to include only words used more than 50 times across her 6 albums, so I could get different results by changing that filtering value when I create the words_by_time data frame. Feel free to play around and see what you get by using different values!

1 comment:

  1. You might want to think a bit about the meaning of the p-value. Are the lyrics a sample from a true population? I don't think so, they are the entire population. So, perhaps it doesn't even make sense to a regression analysis. Simply find a reasonable computation for change over time for each word, and that is your variable of interest.

    ReplyDelete