Tuesday, May 29, 2018

Stress and Its Effect on the Body

After a stressful winter and spring, I'm finally taking a break from work. So of course, what better time to get sick? After a 4-day migraine (started on my first day of vacation - Friday) with a tension headache and neck spasm so bad I couldn't look left, I ended up in urgent care yesterday afternoon. One injection of muscle relaxer, plus prescriptions for more muscle relaxers and migraine meds, and I'm finally feeling better.

Why does this happen? Why is it that after weeks or months of stress, we get sick when we finally get to "come down"?

I've blogged a bit about stress before. Stress causes your body to release certain hormones, such as adrenaline and norepinephrine, which gives an immediate physiological response to stress, and cortisol, which takes a bit longer for you to feel at work in your body. And in fact, cortisol is also involved in many negative consequences of chronic stress. Over time, it can do things like increase blood sugar, suppress the immune system, and contribute to acne breakouts.

You're probably aware that symptoms of sickness are generally caused by your body reacting to and fighting the infection or virus. So the reason you suddenly get sick when the stressor goes away is because your immune system increases function, realizes there's a foreign body that doesn't belong, and starts fighting it. You had probably already caught the virus or infection, but didn't have symptoms like fever (your body's attempt to "cook" it out) or runny nose (your body increasing mucus production to push out the bug), that let you know you were sick.

And in my case in particular, a study published in Neurology found that migraine sufferers were at increased risk of an attack after the stress "let-down." According to the researchers, this effect is even stronger when there is a huge build-up of stress and a sudden, large let-down; it's better to have mini let-downs throughout the stressful experience.

And here I thought I was engaging in a good amount of self-care throughout my stressful February-May.

Sunday, May 27, 2018

Statistics Sunday: Two R Packages to Check Out

I'm currently out of town and not spending as much time on my computer as I have over the last couple months. (It's what happens when you're the only one in your department at work and also most of your hobbies involve a computer.) But I wanted to write up something for Statistics Sunday and I recently discovered two R packages I need to check out in the near future.

The first is called echor, which allows you to search and download data directly from the US Environmental Protection Agency (EPA) Environmental Compliance and History Online (ECHO), using the ECHO-API. According to the vignette, linked above, "ECHO provides data for:
  • Stationary sources permitted under the Clean Air Act, including data from the National Emissions Inventory, Greenhouse Gas Reporting Program, Toxics Release Inventory, and Clean Air Markets Division Acid Rain Program and Clean Air Interstate Rule. 
  • Public drinking water systems permitted under the Safe Drinking Water Act, including data from the Safe Drinking Water Information System. 
  • Hazardous Waste Handlers permitted under the Resource Conservation and Recovery Act, with data drawn from the RCRAInfo data system. 
  • Facilities permitted under the Clean Water Act and the National Pollutant Discharge Elimination Systems (NPDES) program, including data from EPA’s ICIS-NPDES system and possibly waterbody information from EPA’s ATTAINS data system."
The second package is papaja, or Preparing APA Journal Articles, which uses RStudio and R Markdown to create APA-formatted papers. Back when I wrote APA style papers regularly, I had Word styles set up to automatically format headers and subheaders, but properly formatting tables and charts was another story. This package promises to do all of that. It's still in development, but you can find out more about it here and here.

I have some fun analysis projects in the works! Stay tuned.

Wednesday, May 23, 2018

Bad Lip Reading of the Royal Wedding

The more I learn about Megan Markle, the more I love her - especially when I read about her rescue Beagle, Guy. (Thanks to the lovely weather in Chicago and the many people out walking their dogs, I've gotten to pet many puppies over the last couple days, including an incredibly sweet 12-week-old German Shepherd mix this morning.)

And on the subject of Megan Markle: whether you watched the Royal Wedding or not (and no judgment either way), I highly recommend this hilarious Bad Lip Reading of the event:

Statistics in the News

It's been a long road to our new database management system at work, and while we're still working through issues with vendors, I think we're finally going to be able to publish exam scores to our new system today. (Wish me luck!) In the meantime, here are some statistically-themed news stories I'll have to read later:
  • 99% - that's how many requests for access to experimental drugs and treatments are approved by the FDA under the "compassionate use" program; even so, Congress passed a bill providing increased access to experimental treatments
  • 5 - the number of eviction notices these parents sent to their 30-year-old son still living at home; a judge agreed it's time for him to move out
  • 46% and 50% - the percentage of urban and rural residents, respectively, who report drug addiction as one of the biggest problems in their community
  • Less than 20 minutes - how long it will take you to listen to Dr. Frank Newport's 5 key polling insights in this Gallup podcast
  • June 27 - the date of the grand opening of the National Museum of Psychology at the University of Akron

Tuesday, May 22, 2018

How Has Taylor Swift's Word Choice Changed Over Time?

How Has Taylor Swift's Word Choice Changed Over Time? Sunday night was a big night for Taylor Swift - not only was she nominated for multiple Billboard Music Awards; she took home Top Female Artist and Top Selling Album. So I thought it was a good time for some more Taylor Swift-themed statistical analysis.

When I started this blog back in 2011, my goal was to write deep thoughts on trivial topics - specifically, to overthink and overanalyze pop culture and related topics that appear fluffy until you really dig into them. Recently, I've been blogging more about statistics, research, R, and data science, and I've loved getting to teach and share.

But sometimes, you just want to overthink and overanalyze pop culture.

So in a similar vein to the text analysis I've been demonstrating on my blog, I decided to answer a question I'm sure we all have - as Taylor Swift moved from country sweetheart to mega pop star, how have the words she uses in her songs changed?

I've used the geniusR package on a couple posts, and I'll be using it again today to answer this question. I'll be pulling in some additional code, some based on code from the Text Mining with R: A Tidy Approach book I recently devoured, some written to try to tackle this problem I've created for myself to solve. I've shared all my code and tried to credit those who helped me write it where I can.

First, we want to pull in the names of Taylor Swift's 6 studio albums. I found these and their release dates on Wikipedia. While there are only 6 and I could easily copy and paste them to create my data frame, I wanted to pull that data directly from Wikipedia, to write code that could be used on a larger set in the future. Thanks to this post, I could, with a couple small tweaks.

library(rvest)
## Loading required package: xml2
TSdisc <- 'https://en.wikipedia.org/wiki/Taylor_Swift_discography'

disc <- TSdisc %>%
  read_html() %>%
  html_nodes(xpath = '//*[@id="mw-content-text"]/div/table[2]') %>%
  html_table(fill = TRUE)

Since html() is deprecated, I replaced it with read_html(), and I got errors if I didn't add fill = TRUE. The result is a list of 1, with an 8 by 14 data frame within that single list object. I can pull that out as a separate data frame.

TS_albums <- disc[[1]]

The data frame requires a little cleaning. First up, there are 8 rows, but only 6 albums. Because the Wikipedia table had a double header, the second header was read in as a row of data, so I want to delete that, because I only care about the first two columns anyway. The last row contains a footnote that was included with the table. So I removed those two rows, first and last, and dropped the columns I don't need. Second, the information I want with release date was in a table cell along with record label and formats (e.g., CD, vinyl). I don't need those for my purposes, so I'll only pull out the information I want and drop the rest. Finally, I converted year from character to numeric - this becomes important later on.

library(tidyverse)
TS_albums <- TS_albums[2:7,1:2] %>%
  separate(`Album details`, c("Released","Month","Day","Year"),
           extra='drop') %>%
  select(c("Title","Year"))

TS_albums$Year<-as.numeric(TS_albums$Year)

I asked geniusR to download lyrics for all 6 albums. (Note: this code may take a couple minutes to run.) It nests all of the individual album data, including lyrics, into a single column, so I just need to unnest that to create a long file, with album title and release year applied to each unnested line.

library(geniusR)

TS_lyrics <- TS_albums %>%
  mutate(tracks = map2("Taylor Swift", Title, genius_album))
## Joining, by = c("track_title", "track_n", "track_url")
## Joining, by = c("track_title", "track_n", "track_url")
## Joining, by = c("track_title", "track_n", "track_url")
## Joining, by = c("track_title", "track_n", "track_url")
## Joining, by = c("track_title", "track_n", "track_url")
## Joining, by = c("track_title", "track_n", "track_url")
TS_lyrics <- TS_lyrics %>%
  unnest(tracks)

Now we'll tokenize our lyrics data frame, and start doing our word analysis.

library(tidytext)

tidy_TS <- TS_lyrics %>%
  unnest_tokens(word, lyric) %>%
  anti_join(stop_words)
## Joining, by = "word"
tidy_TS %>%
  count(word, sort = TRUE)
## # A tibble: 2,024 x 2
##    word      n
##    <chr> <int>
##  1 time    198
##  2 love    180
##  3 baby    118
##  4 ooh     104
##  5 stay     89
##  6 night    85
##  7 wanna    84
##  8 yeah     83
##  9 shake    80
## 10 ey       72
## # ... with 2,014 more rows

There are a little over 2,000 unique words across TS's 6 albums. But how have they changed over time? To examine this, I'll create a dataset that counts word by year (or album, really). Then I'll use a binomial regression model to look at changes over time, one model per word. In their book, Julia Silge and David Robinson demonstrated how to use binomial regression to examine word use on the authors' Twitter accounts over time, including an adjustment to the p-values to correct for multiple comparisons. So I based on my code off that.

words_by_year <- tidy_TS %>%
  count(Year, word) %>%
  group_by(Year) %>%
  mutate(time_total = sum(n)) %>%
  group_by(word) %>%
  mutate(word_total = sum(n)) %>%
  ungroup() %>%
  rename(count = n) %>%
  filter(word_total > 50)

nested_words <- words_by_year %>%
  nest(-word)

word_models <- nested_words %>%
  mutate(models = map(data, ~glm(cbind(count, time_total) ~ Year, .,
                                 family = "binomial")))

This nests our regression results in a data frame called word_models. While I could unnest and keep all, I don't care about every value the GLM gives me. What I care about is the slope for Year, so the filter selects only that slope and the associated p-value. I can then filter to select the significant/marginally significant slopes for plotting (p < 0.1).

library(broom)

slopes <- word_models %>%
  unnest(map(models, tidy)) %>%
  filter(term == "Year") %>%
  mutate(adjusted.p.value = p.adjust(p.value))

top_slopes <- slopes%>%
  filter(adjusted.p.value < 0.1) %>%
  select(-statistic, -p.value)

This gives me five words that show changes in usage over time: bad, call, dancing, eyes, and yeah. We can plot those five words to see how they've changed in usage over her 6 albums. And because I still have my TS_albums data frame, I can use that information to label the axis of my plot (which is why I needed year to be numeric). I also added a vertical line and annotations to note where TS believes she shifted from country to pop.

library(scales)
words_by_year %>%
  inner_join(top_slopes, by = "word") %>%
  ggplot(aes(Year, count/time_total, color = word, lty = word)) +
  geom_line(size = 1.3) +
  labs(x = NULL, y = "Word Frequency") +
  scale_x_continuous(breaks=TS_albums$Year,
                     labels=TS_albums$Title) +
  scale_y_continuous(labels=scales::percent) +
  geom_vline(xintercept = 2014) +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank()) +
  annotate("text", x = c(2009.5,2015.5), y = c(0.025,0.025),
           label = c("Country", "Pop") , size=5)


The biggest change appears to be in the word "call," which she didn't use at all in her self-titled album, and used at low rates until "1989" and, especially, "Reputation." I can ask for a few examples of "call" in her song lyrics, with grep.

library(expss)
callsubset <- TS_lyrics[grep("call", TS_lyrics$lyric),]
callsubset <- callsubset %>%
  select(Title, Year, track_title, lyric)
set.seed(2012)
callsubset<-callsubset[sample(nrow(callsubset), 3), ]
callsubset<-callsubset[order(callsubset$Year),]
as.etable(callsubset, rownames_as_row_labels = FALSE)
Title  Year   track_title   lyric 
 Speak Now  2010 Back to December (Acoustic) When your birthday passed, and I didn't call
 Red  2012 All Too Well And you call me up again just to break me like a promise
 Reputation  2017 Call It What You Want Call it what you want, call it what you want, call it
On the other hand, she doesn't sing about "eyes" as much now that she's moved from country to pop.

eyessubset <- TS_lyrics[grep("eyes", TS_lyrics$lyric),]
eyessubset <- eyessubset %>%
  select(Title, Year, track_title, lyric)
set.seed(415)
eyessubset<-eyessubset[sample(nrow(eyessubset), 3), ]
eyessubset<-eyessubset[order(eyessubset$Year),]
as.etable(eyessubset, rownames_as_row_labels = FALSE)
Title  Year   track_title   lyric 
 Taylor Swift  2006 A Perfectly Good Heart And realized by the distance in your eyes that I would be the one to fall
 Speak Now  2010 Better Than Revenge I'm just another thing for you to roll your eyes at, honey
 Red  2012 State of Grace Just twin fire signs, four blue eyes
Bet you'll never listen to Taylor Swift the same way again.

A few notes: I opted to examine any slopes with p < 0.10, which is greater than conventional levels of significance; if you look at the adjusted p-value column, though, you'll see that 4 of the 5 are < 0.05 and one is only slightly greater than 0.05. But I made the somewhat arbitrary choice to include only words used more than 50 times across her 6 albums, so I could get different results by changing that filtering value when I create the words_by_time data frame. Feel free to play around and see what you get by using different values!

Sunday, May 20, 2018

Statistics Sunday: Welcome to Sentiment Analysis with "Hotel California"

Welcome to the Hotel California As promised in last week's post, this week: sentiment analysis, also with song lyrics.

Sentiment analysis is a method of natural language processing that involves classifying words in a document based on whether a word is positive or negative, or whether it is related to a set of basic human emotions; the exact results differ based on the sentiment analysis method selected. The tidytext R package has 4 different sentiment analysis methods:
  • "AFINN" for Finn Ã…rup Nielsen - which classifies words from -5 to +5 in terms of negative or positive valence
  • "bing" for Bing Liu and colleagues - which classifies words as either positive or negative
  • "loughran" for Loughran-McDonald - mostly for financial and nonfiction works, which classifies as positive or negative, as well as topics of uncertainty, litigious, modal, and constraining
  • "nrc" for the NRC lexicon - which classifies words into eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) as well as positive or negative sentiment
Sentiment analysis works on unigrams - single words - but you can aggregate across multiple words to look at sentiment across a text.

To demonstrate sentiment analysis, I'll use one of my favorite songs: "Hotel California" by the Eagles.

I know, I know.

The Dude has had a rough night and he hates the f'ing Eagles.

Using similar code as last week, let's pull in the lyrics of the song.

library(geniusR)
library(tidyverse)
hotel_calif <- genius_lyrics(artist = "Eagles", song = "Hotel California") %>%
  mutate(line = row_number())

First, we'll chop up these 43 lines into individual words, using the tidytext package and unnest_tokens function.

library(tidytext)
tidy_hc <- hotel_calif %>%
  unnest_tokens(word,lyric)

This is also probably the point I would remove stop words with anti_join. But these common words are very unlikely to have a sentiment attached to them, so I'll leave them in, knowing they'll be filtered out anyway by this analysis. We have 4 lexicons to choose from. Loughran is more financial and textual, but we'll still see how well it can classify the words anyway. First, let's create a data frame of our 4 sentiment lexicons.

new_sentiments <- sentiments %>%
  mutate(sentiment = ifelse(lexicon == "AFINN" & score >= 0, "positive",
                             ifelse(lexicon == "AFINN" & score < 0,
                                    "negative", sentiment))) %>%
  group_by(lexicon) %>%
  mutate(words_in_lexicon = n_distinct(word)) %>%
  ungroup()

Now, we'll see how well the 4 lexicons match up with the words in the lyrics. Big thanks to Debbie Liske at Data Camp for this piece of code (and several other pieces used in this post):

my_kable_styling <- function(dat, caption) {
  kable(dat, "html", escape = FALSE, caption = caption) %>%
    kable_styling(bootstrap_options = c("striped", "condensed", "bordered"),
                  full_width = FALSE)
}


library(kableExtra)
library(formattable)
library(yarrr)
tidy_hc %>%
  mutate(words_in_lyrics = n_distinct(word)) %>%
  inner_join(new_sentiments) %>%
  group_by(lexicon, words_in_lyrics, words_in_lexicon) %>%
  summarise(lex_match_words = n_distinct(word)) %>%
  ungroup() %>%
  mutate(total_match_words = sum(lex_match_words),
         match_ratio = lex_match_words/words_in_lyrics) %>%
  select(lexicon, lex_match_words, words_in_lyrics, match_ratio) %>%
  mutate(lex_match_words = color_bar("lightblue")(lex_match_words),
         lexicon = color_tile("lightgreen","lightgreen")(lexicon)) %>%
  my_kable_styling(caption = "Lyrics Found In Lexicons")
## Joining, by = "word"
Lyrics Found In Lexicons
lexicon lex_match_words words_in_lyrics match_ratio
AFINN 18 175 0.1028571
bing 18 175 0.1028571
loughran 1 175 0.0057143
nrc 23 175 0.1314286

NRC offers the best match, classifying about 13% of the words in the lyrics. (It's not unusual to have such a low percentage. Not all words have a sentiment.)

hcsentiment <- tidy_hc %>%
  inner_join(get_sentiments("nrc"), by = "word")

hcsentiment
## # A tibble: 103 x 4
##    track_title       line word   sentiment
##    <chr>            <int> <chr>  <chr>    
##  1 Hotel California     1 dark   sadness  
##  2 Hotel California     1 desert anger    
##  3 Hotel California     1 desert disgust  
##  4 Hotel California     1 desert fear     
##  5 Hotel California     1 desert negative 
##  6 Hotel California     1 desert sadness  
##  7 Hotel California     1 cool   positive 
##  8 Hotel California     2 smell  anger    
##  9 Hotel California     2 smell  disgust  
## 10 Hotel California     2 smell  negative 
## # ... with 93 more rows

Let's visualize the counts of different emotions and sentiments in the NRC lexicon.

theme_lyrics <- function(aticks = element_blank(),
                         pgminor = element_blank(),
                         lt = element_blank(),
                         lp = "none")
{
  theme(plot.title = element_text(hjust = 0.5), #Center the title
        axis.ticks = aticks, #Set axis ticks to on or off
        panel.grid.minor = pgminor, #Turn the minor grid lines on or off
        legend.title = lt, #Turn the legend title on or off
        legend.position = lp) #Turn the legend on or off
}

hcsentiment %>%
  group_by(sentiment) %>%
  summarise(word_count = n()) %>%
  ungroup() %>%
  mutate(sentiment = reorder(sentiment, word_count)) %>%
  ggplot(aes(sentiment, word_count, fill = -word_count)) +
  geom_col() +
  guides(fill = FALSE) +
  theme_minimal() + theme_lyrics() +
  labs(x = NULL, y = "Word Count") +
  ggtitle("Hotel California NRC Sentiment Totals") +
  coord_flip()


Most of the words appear to be positively-valenced. How do the individual words match up?

library(ggrepel)

plot_words <- hcsentiment %>%
  group_by(sentiment) %>%
  count(word, sort = TRUE) %>%
  arrange(desc(n)) %>%
  ungroup()

plot_words %>%
  ggplot(aes(word, 1, label = word, fill = sentiment)) +
  geom_point(color = "white") +
  geom_label_repel(force = 1, nudge_y = 0.5,
                   direction = "y",
                   box.padding = 0.04,
                   segment.color = "white",
                   size = 3) +
  facet_grid(~sentiment) +
  theme_lyrics() +
  theme(axis.text.y = element_blank(), axis.line.x = element_blank(),
        axis.title.x = element_blank(), axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        panel.grid = element_blank(), panel.background = element_blank(),
        panel.border = element_rect("lightgray", fill = NA),
        strip.text.x = element_text(size = 9)) +
  xlab(NULL) + ylab(NULL) +
  ggtitle("Hotel California Words by NRC Sentiment") +
  coord_flip()


It looks like some words are being misclassified. For instance, "smell" as in "warm smell of colitas" is being classified as anger, disgust, and negative. But that doesn't explain the overall positive bent being applied to the song. If you listen to the song, you know it's not really a happy song. It starts off somewhat negative - or at least, ambiguous - as the narrator is driving on a dark desert highway. He's tired and having trouble seeing, and notices the Hotel California, a shimmering oasis on the horizon. He stops in and is greated by a "lovely face" in a "lovely place." At the hotel, everyone seems happy: they dance and drink, they have fancy cars, they have pretty "friends."

But the song is in a minor key. Though not always a sign that a song is sad, it is, at the very least, a hint of something ominous, lurking below the surface. Soon, things turn bad for the narrator. The lovely-faced woman tells him they are "just prisoners here of our own device." He tries to run away, but the night man tells him, "You can check out anytime you like, but you can never leave."

The song seems to be a metaphor for something, perhaps fame and excess, which was also the subject of another song on the same album, "Life in the Fast Lane." To someone seeking fame, life is dreary, dark, and deserted. Fame is like an oasis - beautiful and shimmering, an escape. But it isn't all it appears to be. You may be surrounded by beautiful people, but you can only call them "friends." You trust no one. And once you join that lifestyle, you might be able to check out, perhaps through farewell tour(s), but you can never leave that life - people know who you are (or were) and there's no disappearing. And it could be about something even darker that it's hard to escape from, like substance abuse. Whatever meaning you ascribe to the song, the overall message seems to be that things are not as wonderful as they appear on the surface.

So if we follow our own understanding of the song's trajectory, we'd say it starts off somewhat negatively, becomes positive in the middle, then dips back into the negative at the end, when the narrator tries to escape and finds he cannot.

We can chart this, using the line number, which coincides with the location of the word in the song. We'll stick with NRC since it offered the best match, but for simplicity, we'll only pay attention to the positive and negative sentiment codes.

hcsentiment_index <- tidy_hc %>%
  inner_join(get_sentiments("nrc")%>%
               filter(sentiment %in% c("positive",
                                       "negative"))) %>%
  count(index = line, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)
## Joining, by = "word"

This gives us a data frame that aggregates sentiment by line. If a line contains more positive than negative words, its overall sentiment is positive, and vice versa. Because not every word in the lyrics has a sentiment, not every line has an associated aggregate sentiment. But it gives us a sort of trajectory over the course of the song. We can visualize this trajectory like this:

hcsentiment_index %>%
  ggplot(aes(index, sentiment, fill = sentiment > 0)) +
  geom_col(show.legend = FALSE)


As the chart shows, the song starts somewhat positive, with a dip soon after into the negative. The middle of the song is positive, as the narrator describes the decadence of the Hotel California. But it turns dark at the end, and stays that way as the guitar solo soars in.

Sources

This awesome post by Debbie Liske, mentioned earlier, for her code and custom functions to make my charts pretty.

Text Mining with R: A Tidy Approach by Julia Silge and David Robinson

Friday, May 18, 2018

What Makes a Song (More) Popular

Earlier this week, the Association for Psychological Science sent out a press release about a study examining what makes a song popular:
Researchers Jonah Berger of the University of Pennsylvania and Grant Packard of Wilfrid Laurier University were interested in understanding the relationship between similarity and success. In a recent study published in Psychological Science, the authors describe how a person’s drive for stimulation can be satisfied by novelty. Cultural items that are atypical, therefore, may be more liked and become more popular.

“Although some researchers have argued that cultural success is impossible to predict,” they explain, “textual analysis of thousands of songs suggests that those whose lyrics are more differentiated from their genres are more popular.”
The study, which is was published online ahead of print, used a method of topic modeling called latent Dirichlet allocation. (Side note, this analysis is available in the R topicmodels package, as function LDA. It requires a document term matrix, which can be created in R. Perhaps a future post!) The LDA extracted 10 topics from the lyrics of songs spanning seven genres (Christian, country, dance, pop, rap, rock, and rhythm and blues):
  • Anger and violence
  • Body movement
  • Dance moves
  • Family
  • Fiery love
  • Girls and cars
  • Positivity
  • Spiritual
  • Street cred
  • Uncertain love
Overall, they found that songs with lyrics that differentiated them from other songs in their genre were more popular. However, this wasn't the case for the specific genres of pop and dance, where lyrical differentiation appeared to be harmful to popularity. Finally, being lyrically different by being more similar to a different genre (a genre to which the song wasn't defined) had no impact. So it isn't about writing a rock song that sounds like a rap song to gain popularity; it's about writing a rock song that sounds different from other rock songs.

I love this study idea, especially since I've started doing some text and lyric analysis on my own. (Look for another one Sunday, tackling the concept of sentiment analysis!) But I do have a criticism. This research used songs listed in the Billboard Top 50 by genre. While it would be impossible to analyze every single song that comes out a given time, this study doesn't really answer the question of what makes a song popular, but what determines how popular an already popular song is. The advice in the press release (To Climb the Charts, Write Lyrics That Stand Out), may be true for established artists who are already popular, but it doesn't help that young artist trying to break onto the scene. They're probably already writing lyrics to try to stand out. They just haven't been noticed yet.

Tuesday, May 15, 2018

Happy Birthday, L. Frank Baum!

Today is the 162nd birthday of L. Frank Baum who, as author of The Wonderful Wizard of Oz and 13 other "Oz" books, had a profound effect on my childhood and may even be responsible for my love of writing.

By George Steckel - Los Angeles Times photographic archive, UCLA Library, Public Domain, Link
And two days from now, on May 17, it will be 118 years since the first book of the Oz series was published. I was obsessed with the book series as a kid, and still collect antique copies of the books (and a few other Oz collectibles).

Monday, May 14, 2018

The Odds Were Never in His Favor

Riddler and Chess-Lover Oliver Roeder blogged today about a once in a lifetime opportunity - playing a game of chess against world champion Magnus Carlsen... and lost, of course. But he obviously had a great time:
To my nervously trembling chagrin, they’d set up my chess board correctly and in the traditional fashion: I had only the one queen and the two rooks and so forth, and somehow it was deemed appropriate that Carlsen start with the identical and equal number of pieces. A grandmaster buddy of mine texted me before the game, “Make sure your pieces are defended.” It certainly sounded simple enough. My aunt wrote on Facebook, “I hope Oliver wins!” Other well-wishers wished me “good luck.”

Thanks, but what luck? Chess is stripped of that frivolity; it’s the canonical no-chance, perfect information game. That nakedness is why boxing is a good analogy to chess: two people battling in a confined space with nothing, not a shroud of randomness or the fog of war, to hide behind. I once beat the Scrabble national champion (in Scrabble, not chess), but that was only because a) I sort of knew what I was doing and b) there is luck in that game that I could hide behind. I got lucky. Awaiting the world chess champion, I harbored no such idiotic delusions as I sat at an enormous horseshoe table, fretting and adjusting the pieces. Carlsen was about to do to my psyche what Mike Tyson would’ve done to my face. There was no escape.
Roeder ran his and Carlsen's moves through a chess engine called Stockfish, which estimates who is likely to win based on every position in the game.


And you can watch and move-by-move replay of the game here.

Chess is such a fascinating game. I used to be able to play many years ago but I've forgotten over the last couple decades. One of my goals this year is to relearn.

Sunday, May 13, 2018

Statistics Sunday: Taylor Swift vs. Lorde - Analyzing Song Lyrics

Statistics Sunday Last week, I showed how to tokenize text. Today I'll use those functions to do some text analysis of one of my favorite types of text: song lyrics. Plus, this is a great opportunity to demonstrate a new R package I discovered: geniusR, which will download lyrics from Genius.

There are two packages - geniusR and geniusr - which will do this. I played with both and found geniusR easier to use. Neither is perfect, but what is perfect, anyway?

To install geniusR, you'll use a different method than usual - you'll need to install the package devtools, then call the install_github function to download the R package directly from GitHub.

install.packages("devtools")
devtools::install_github("josiahparry/geniusR")
## Downloading GitHub repo josiahparry/geniusR@master
## from URL https://api.github.com/repos/josiahparry/geniusR/zipball/master
## Installing geniusR
## '/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file  \
##   --no-environ --no-save --no-restore --quiet CMD INSTALL  \
##   '/private/var/folders/85/9ygtlz0s4nxbmx3kgkvbs5g80000gn/T/Rtmpl3bwRx/devtools33c73e3f989/JosiahParry-geniusR-5907d82'  \
##   --library='/Library/Frameworks/R.framework/Versions/3.4/Resources/library'  \
##   --install-tests
## 

Now you'll want to load geniusR and tidyverse so we can work with our data.

library(geniusR)
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble  1.4.2     ✔ dplyr   0.7.4
## ✔ tidyr   0.8.0     ✔ stringr 1.3.0
## ✔ readr   1.1.1     ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

For today's demonstration, I'll be working with data from two artists I love: Taylor Swift and Lorde. Both dropped new albums last year, Reputation and Melodrama, respectively, and both, though similar in age and friends with each other, have very different writing and musical styles.

geniusR has a function genius_album that will download lyrics from an entire album, labeling it by track.

swift_lyrics <- genius_album(artist="Taylor Swift", album="Reputation")
## Joining, by = c("track_title", "track_n", "track_url")
lorde_lyrics <- genius_album(artist="Lorde", album="Melodrama")
## Joining, by = c("track_title", "track_n", "track_url")

Now we want to tokenize our datasets, remove stop words, and count word frequency - this code should look familiar, except this time, I'm combining them using the pipeline symbol (%>%) from the tidyverse, which allows you to string together multiple functions without having to nest them.

library(tidytext)
tidy_swift <- swift_lyrics %>%
  unnest_tokens(word,lyric) %>%
  anti_join(stop_words) %>%
  count(word, sort=TRUE)
## Joining, by = "word"
head(tidy_swift)
## # A tibble: 6 x 2
##   word      n
##   <chr> <int>
## 1 call     46
## 2 wanna    37
## 3 ooh      35
## 4 ha       34
## 5 ah       33
## 6 time     32
tidy_lorde <- lorde_lyrics %>%
  unnest_tokens(word,lyric) %>%
  anti_join(stop_words) %>%
  count(word, sort=TRUE)
## Joining, by = "word"
head(tidy_lorde)
## # A tibble: 6 x 2
##   word         n
##   <chr>    <int>
## 1 boom        40
## 2 love        26
## 3 shit        24
## 4 dynamite    22
## 5 homemade    22
## 6 light       22

Looking at the top 6 words for each, it doesn't look like there will be a lot of overlap. But let's explore that, shall we? Lorde's album is 3 tracks shorter than Taylor Swift's. To make sure our word comparisons are meaningful, I'll create new variables that take into account total number of words, so each word metric will be a proportion, allowing for direct comparisons. And because I'll be joining the datasets, I'll be sure to label these new columns by artist name.

tidy_swift <- tidy_swift %>%
  rename(swift_n = n) %>%
  mutate(swift_prop = swift_n/sum(swift_n))

tidy_lorde <- tidy_lorde %>%
  rename(lorde_n = n) %>%
  mutate(lorde_prop = lorde_n/sum(lorde_n))

There are multiple types of joins available in the tidyverse. I used an anti_join to remove stop words. Today, I want to use a full_join, because I want my final dataset to retain all words from both artists. When one dataset contributes a word not found in the other artist's set, it will fill those variables in with missing values.

compare_words <- tidy_swift %>%
  full_join(tidy_lorde, by = "word")

summary(compare_words)
##      word              swift_n         swift_prop         lorde_n    
##  Length:957         Min.   : 1.000   Min.   :0.00050   Min.   : 1.0  
##  Class :character   1st Qu.: 1.000   1st Qu.:0.00050   1st Qu.: 1.0  
##  Mode  :character   Median : 1.000   Median :0.00050   Median : 1.0  
##                     Mean   : 3.021   Mean   :0.00152   Mean   : 2.9  
##                     3rd Qu.: 3.000   3rd Qu.:0.00151   3rd Qu.: 3.0  
##                     Max.   :46.000   Max.   :0.02321   Max.   :40.0  
##                     NA's   :301      NA's   :301       NA's   :508   
##    lorde_prop    
##  Min.   :0.0008  
##  1st Qu.:0.0008  
##  Median :0.0008  
##  Mean   :0.0022  
##  3rd Qu.:0.0023  
##  Max.   :0.0307  
##  NA's   :508

The final dataset contains 957 tokens - unique words - and the NAs tell how many words are only present in one artist's corpus. Lorde uses 301 words Taylor Swift does not, and Taylor Swift uses 508 words that Lorde does not. That leaves 148 words on which they overlap.

There are many things we could do with these data, but let's visualize words and proportions, with one artist on the x-axis and the other on the y-axis.

ggplot(compare_words, aes(x=swift_prop, y=lorde_prop)) +
  geom_abline() +
  geom_text(aes(label=word), check_overlap=TRUE, vjust=1.5) +
  labs(y="Lorde", x="Taylor Swift") + theme_classic()
## Warning: Removed 809 rows containing missing values (geom_text).

The warning lets me know there are 809 rows with missing values - those are the words only present in one artist's corpus. Words that fall on or near the line are used at similar rates between artists. Words above the line are used more by Lorde than Taylor Swift, and words below the line are used more by Taylor Swift than Lorde. This tells us that, for instance, Lorde uses "love," "light," and, yes, "shit," more than Swift, while Swift uses "call," "wanna," and "hands" more than Lorde. They use words like "waiting," "heart," and "dreams" at similar rates. Rates are low overall, but if you look at the max values for the proportion variables, Swift's most common word only accounts for about 2.3% of her total words; Lorde's most common word only accounts for about 3.1% of her total words.

This highlights why it's important to remove stop words for these types of analyses; otherwise, our datasets and chart would be full of words like "the," "a", and "and."

Next Statistics Sunday, we'll take a look at sentiment analysis!

Friday, May 11, 2018

Neuroscience, Dopamine, and Why We Struggle to Read

I'm a proud book worm. Each year I challenge myself to read a certain number of books, and do so publicly, thanks to Goodreads. Last year, I read 53 books. This year, I challenged myself to read 60. I was doing really well. Then April and May happened, and with it Blogging A to Z, multiple events and performances, work insanity, and some major life stuff. I found it harder to make time for and concentrate on reading.

I got off track, and was disheartened when I logged into book reads and saw that I was behind schedule.


I know I should be proud that I've read 19 books already this year, but that "2 books behind schedule" keeps drawing my attention away from the thing I should be proud of.

And I'm not alone. A lot of people are having difficulty concentrating on and enjoying their time with books. We get distracted by a variety of things, including phone and email. So it's good timing that someone shared with me this article by Hugh McGuire, who built his life on books and reading, and discusses his own difficulty with getting through his ever-growing reading list:
This sickness is not limited to when I am trying to read, or once-in-a-lifetime events with my daughter.

At work, my concentration is constantly broken: finishing writing an article (this one, actually), answering that client’s request, reviewing and commenting on the new designs, cleaning up the copy on the About page. Contacting so and so. Taxes.

It turns out that digital devices and software are finely tuned to train us to pay attention to them, no matter what else we should be doing. The mechanism, borne out by recent neuroscience studies, is something like this:
  • New information creates a rush of dopamine to the brain, a neurotransmitter that makes you feel good.
  • The promise of new information compels your brain to seek out that dopamine rush.
  • With fMRIs, you can see the brain’s pleasure centres light up with activity when new emails arrive.
So, every new email you get gives you a little flood of dopamine. Every little flood of dopamine reinforces your brain’s memory that checking email gives a flood of dopamine. And our brains are programmed to seek out things that will give us little floods of dopamine. Further, these patterns of behaviour start creating neural pathways, so that they become unconscious habits: Work on something important, brain itch, check email, dopamine, refresh, dopamine, check Twitter, dopamine, back to work. Over and over, and each time the habit becomes more ingrained in the actual structures of our brains.

There is a famous study of rats, wired up with electrodes on their brains. When the rats press a lever, a little charge gets released in part of their brain that stimulates dopamine release. A pleasure lever.

Given a choice between food and dopamine, they’ll take the dopamine, often up to the point of exhaustion and starvation. They’ll take the dopamine over sex. Some studies see the rats pressing the dopamine lever 700 times in an hour.

We do the same things with our email. Refresh. Refresh.
And it's true - sometimes when I've made time to read, I find myself distracted by the digital world: Are there new posts on the blogs I follow? What's going on on Facebook? Hey, Postmodern Jukebox has a new video!

After I put my phone or computer down and pick up the book again, I sometimes have to reread a bit to remind myself where I was or because I wasn't really paying attention the first time I read a paragraph, distracted by what else might be going on in the world.

What can we do to change this? Hugh McGuire decided to set some rules for himself, such as keeping himself from checking Twitter and Facebook during certain times. What about you, readers? Any rules you make for yourself to keep your mind on the task at hand?