Deeply Trivial: music

Showing posts with label music. Show all posts

Saturday, December 19, 2020

Some Music for Your Holidays

Hey everyone,

One thing I've been doing during the pandemic is making music on my own. For our holiday season, I dropped my very first album: Winter Delights. You can read about the album and download tracks here or stream me on Soundcloud. I'm working on more arrangements (and upgraded my audio recording equipment) so I'm hoping to drop a full album early in the New Year!

And to give you a little extra something, here's a selection of performances from my choir's annual cabaret benefit, Apollo After Hours:

Wednesday, August 12, 2020

Creating Things

Normally, this time of year, we'd be getting excited for my choir's new season and rehearsals to begin in early September. Sadly, with the pandemic, it's unlikely we'll be getting together then, and I'm not sure how long it will take before it's safe and people begin feeling comfortable gathering in such a way. So I've been seeking out ways to keep some creativity in my life.

I've started drawing again, something I haven't done in years. I'm a bit rusty but hey - practice practice, right? I started with some pretty flowers from my parents' backyard, in a combination of soft chalk pastels (my favorite medium) and colored pencil:

And my next project is going to be a self-portrait, something I've never done before. Some early work with pencil that I'll fill in soon (thinking again a combo of colored pencil and chalk pastels):

I also had some fun putting together a Lego Architecture set of Paris:

What mainly sparked this round of creativity was writing and recording an arrangement for my choir's virtual benefit. I had so much fun with that, I'm going to keep doing it! I'm planning to share that video soon, and have also started recording some other a cappella arrangements I plan on sharing.

And lastly, because I needed to bring Zep into the fun too, I've finally set up an Instagram for him. If you're on the 'gram, you can follow him here: https://www.instagram.com/zeppelinblackdog/

Tuesday, July 7, 2020

Free Virtual Concert!

One of my hobbies is singing, and for the last 15 years, I've been a member of the Apollo Chorus of Chicago. As with many musical arts organizations, we canceled our Spring concerts, including our annual Apollo After Hours benefit, due to COVID-19. It's unclear when in the future music organizations will be able to have in-person concerts again - possibly years.

But that doesn't mean we can't make - and share - beautiful music with you. On Friday, July 17 at 7PM, we'll be broadcasting our annual benefit as a free, virtual performance. Lots of singers in my choir have created videos to be included in the broadcast, including me! Here's a photo preview:

I'll be performing an a cappella arrangement I wrote of a Sara Bareilles song, "Breathe Again." If you want to hear it, you'll have to tune in! Find out more and sign up to get the link once it goes live here.

Thursday, September 5, 2019

Mad Tangerine-Colored Commissar

If you haven't already, you must check out Randy Rainbow's brilliant showtunes medley/political commentary:

Tuesday, September 11, 2018

No Take Bachs

About a week ago, Boing Boing published a story with a shocking claim: you can't post performance videos of Bach's music because Sony owns the compositions.

Wait, what?

James Rhodes, a pianist, performed a Bach composition for his Facebook account, but it didn't go up -- Facebook's copyright filtering system pulled it down and accused him of copyright infringement because Sony Music Global had claimed that they owned 47 seconds' worth of his personal performance of a song whose composer has been dead for 300 years.

You don't need to be good at math to know that this claim must be false. Sony can't possibly own compositions that are clearly in the public domain. What this highlights though is that something, while untrue in theory, can be true in practice. Free Beacon explains:

As it happens, the company genuinely does hold the copyright for several major Bach recordings, a collection crowned by Glenn Gould's performances. The YouTube claim was not that Sony owned Bach's music in itself. Rather, YouTube conveyed Sony's claim that Rhodes had recycled portions of a particular performance of Bach from a Sony recording.

The fact that James Rhodes was actually playing should have been enough to halt any sane person from filing the complaint. But that's the real point of the story. No sane person was involved, because no actual person was involved. It all happened mechanically, from the application of the algorithms in Youtube's Content ID system. A crawling bot obtained a complex digital signature for the sound in Rhodes's YouTube posting. The system compared that signature to its database of registered recordings and found a partial match of 47 seconds. The system then automatically deleted the video and sent a "dispute claim" to Rhodes's YouTube channel. It was a process entirely lacking in human eyes or human ears. Human sanity, for that matter.

Does Sony own copyright on Bach in theory? No, absolutely not. But this system, which scans for similarity in the audio, is making this claim true in practice: performers of Bach's music will be flagged automatically by the system as using copyrighted content, and attacked with take-down notices and/or having their videos deleted altogether. There's only so much one can do with interpretation and tempo to change the sound, and while skill of the performer will also impact the audio, to a computer, the same notes and same tempo will sound the same.

Automation is being added to this and many related cases to take out the bias of human judgment. This leads to a variety of problems with technology running rampant and affecting lives, as has been highlighted in recent books like Weapons of Math Destruction and Technically Wrong.

A human being watching Rhodes's video would be able to tell right away that no copyright infringement took place. Rhodes was playing the same composition played in a performance owned by Sony - it's the same source material, which is clearly in the public domain, rather than the same recording, which is not public domain.

This situation is also being twisted into a way to make money:

[T]he German music professor Ulrich Kaiser wanted to develop a YouTube channel with free performances for teaching classical music. The first posting "explained my project, while examples of the music played in the background. Less than three minutes after uploading, I received a notification that there was a Content ID claim against my video." So he opened a different YouTube account called "Labeltest" to explore why he was receiving claims against public-domain music. Notices from YouTube quickly arrived for works by Bartok, Schubert, Puccini, Wagner, and Beethoven. Typically, they read, "Copyrighted content was found in your video. The claimant allows its content to be used in your YouTube video. However, advertisements may be displayed."

And that "advertisements may be displayed" is the key. Professor Kaiser wanted an ad-free channel, but his attempts to take advantage of copyright-free music quickly found someone trying to impose advertising on him—and thereby to claim some of the small sums that advertising on a minor YouTube channel would generate.

Last January, an Australian music teacher named Sebastian Tomczak had a similar experience. He posted on YouTube a 10-hour recording of white noise as an experiment. "I was interested in listening to continuous sounds of various types, and how our perception of these kinds of sounds and our attention changes over longer periods," he wrote of his project. Most listeners would probably wonder how white noise, chaotic and random by its nature, could qualify as a copyrightable composition (and wonder as well how anyone could get through 10 hours of it). But within days, the upload had five different copyright claims filed against it. All five would allow continued use of the material, the notices explained, if Tomczak allowed the upload to be "monetized," meaning accompanied by advertisements from which the claimants would get a share.

Sunday, September 9, 2018

150 Years in Chicago

This morning, I joined some of my musician friends to sing for a mass celebrating the 150th anniversary of St. Thomas the Apostle in Hyde Park, Chicago. In true music-lover fashion, the mass was also part concert, featuring some gorgeous choral works, including:

Kyrie eleison from Vierne's Messe Solennelle: my choir is performing this work in our upcoming season, and I give 4 to 1 odds that this movement, the Kyrie, ends up in our Fall Preview concert (which I unfortunately won't be singing, due to a work commitment)
Locus iste by Bruckner
Thou knowest, Lord by Purcell
Ave Verum Corpus by Mozart
Amen from Handel's Messiah (which my choir performs every December)

The choir was small but mighty - 3 sopranos, 4 altos, 4 tenors, and 5 basses - and in addition to voices, we had violin, trumpet, cello, and organ. The music was gorgeous. Afterward, I ate way too much food at the church picnic.

I'm tempted to take an afternoon nap. Instead, I'll be packing up my apartment.

Saturday, August 25, 2018

Happy Birthday, Leonard Bernstein

Today's Google Doodle celebrates the 100th birthday of Leonard Bernstein:

This year, multiple performing arts organizations have joined in the celebration by performing some of Bernstein's works. But, given that today is the big day, I feel like I should spend my time sitting and listening to some of my favorites.

If you're not sure where to begin, I highly recommend this list from Vox - 5 of Bernstein's musical masterpieces.

Wednesday, August 8, 2018

Cowboy Bebop 20 Years Later

I've been asked before what made me fall in love with corgis. Did I idolize British royalty? Did I have one or know someone who had one as a child? Did I just think their stumpy legs and enormous ears were adorable? (Well, yeah.)

But the real reason? I started watching Cowboy Bebop in college and fell in love with Ein, the adorable corgi on the show. Yes, I love corgis because I loved this anime.

For anyone who has seen this anime, all I have to do is say I love Cowboy Bebop and they furiously nod in agreement, perhaps singing some of their favorite songs from the show or doing their best Spike Spiegel impression. For anyone who hasn't seen it, it's difficult to describe the show because it defies categorization. But this article does a wonderful job of summing it up:

Cowboy Bebop is its own thing. You could call it a space western. That’s, however, like saying Beyoncé is a good dancer. It’s true. But it’s not the whole picture. Not even close. Cowboy Bebop is a mashup of noir films, spaghetti westerns, urban thrillers, Kurosawa samurai films, classic westerns and sci-fi space adventures.

Again, though, all that genre bending is just part of the equation. The show resonates so deeply because it’s a mirror in which you can see yourself, and how we all wrestle with life. This is what makes Cowboy Bebop great art. It’s a beautifully complex, aesthetically striking meditation on how we deal with love, loss, luck and that inescapable question: Why should I give a fuck?

It's been 20 years since Cowboy Bebop premiered, and those 26 episodes and 1 movie remain relevant and influential today. Time to rewatch the show, I think.

Sunday, July 29, 2018

Statistics Sunday: More Text Analysis - Term Frequency and Inverse Document Frequency

Statistics Sunday: Term Frequency and Inverse Document Frequency As a mixed methods researcher, I love working with qualitative data, but I also love the idea of using quantitative methods to add some meaning and context to the words. This is the main reason I've started digging into using R for text mining, and these skills have paid off in not only fun blog posts about Taylor Swift, Lorde, and "Hotel California", but also in analyzing data for my job (blog post about that to be posted soon). So today, I thought I'd keep moving forward to other tools you can use in text analysis: term frequency and inverse document frequency.

These tools are useful when you have multiple documents you're analyzing, such as interview text from different people or books by the same author. For my demonstration today, I'll be using (what else?) song lyrics, this time from Florence + the Machine (one of my all-time favorites), who just dropped a new album, High as Hope. So let's get started by pulling in those lyrics.

library(geniusR)

high_as_hope <- genius_album(artist = "Florence the Machine", album = "High as Hope")

## Joining, by = c("track_title", "track_n", "track_url")

library(tidyverse)

library(tidytext)
tidy_hope <- high_as_hope %>%
  unnest_tokens(word,lyric) %>%
  anti_join(stop_words)

## Joining, by = "word"

head(tidy_hope)

## # A tibble: 6 x 4
##   track_title track_n  line word   
##   <chr>         <int> <int> <chr>  
## 1 June              1     1 started
## 2 June              1     1 crack  
## 3 June              1     2 woke   
## 4 June              1     2 chicago
## 5 June              1     2 sky    
## 6 June              1     2 black

Now we have a tidy dataset with stop words removed. Before we go any farther, let's talk about the tools we're going to apply. Often, when we analyze text, we want to try to discover what different documents are about - what are their topics or themes? One way to do that is to look at common words used in a document, which can tell us something about the document's theme. An overall measure of how often a term comes up in a particular document is term frequency (TF).

Removing stop words is an important step before looking at TF, because otherwise, the high frequency words wouldn't be very meaningful - they'd be words that fill every sentence, like "the" or "a." But there still might be many common words that don't get weeded out by our stop words anti-join. And it's often the less frequently used words that tell us something about the meaning of a document. This is where inverse document frequency (IDF) comes in; it takes into account how common a word is across a set of documents, and gives higher weight to words that are infrequent across a set of documents and lower weight to common words. This means that a word used a great deal in one song but very little in the other songs will have a higher IDF.

We can use these two values at the same time, by multiplying them together to form TF-IDF, which tells us the frequency of the term in a document adjusted for how common it is across a set of documents. And thanks to the tidytext package, these values can be automatically calculated for us with the bind_tf_idf function. First, we need to reformat our data a bit, by counting use of each word by song. We do this by referencing the track_title variable in our count function, which tells R to group by this variable, followed by what we want R to count (the variable called word).

song_words <- tidy_hope %>%
  count(track_title, word, sort = TRUE) %>%
  ungroup()

The bind_tf_idf function needs 3 arguments: word (or whatever we called the variable containing our words), the document indicator (in this case, track_title), and the word counts by document (n).

song_words <- song_words %>%
  bind_tf_idf(word, track_title, n) %>%
  arrange(desc(tf_idf))

head(song_words)

## # A tibble: 6 x 6
##   track_title     word          n    tf   idf tf_idf
##   <chr>           <chr>     <int> <dbl> <dbl>  <dbl>
## 1 Hunger          hunger       25 0.236  2.30  0.543
## 2 Grace           grace        16 0.216  2.30  0.498
## 3 The End of Love wash         18 0.209  2.30  0.482
## 4 Hunger          ooh          20 0.189  2.30  0.434
## 5 Patricia        wonderful    10 0.125  2.30  0.288
## 6 100 Years       hundred      12 0.106  2.30  0.245

Some of the results are unsurprising - "hunger" is far more common in the track called "Hunger" than any other track, "grace" is more common in "Grace", and "hundred" is more common in "100 Years". But let's explore the different words by plotting the highest tf-idf for each track. To keep the plot from getting ridiculously large, I'll just ask for the top 5 for each of the 10 tracks.

song_words %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>%
  group_by(track_title) %>%
  top_n(5) %>%
  ungroup() %>%
  ggplot(aes(word, tf_idf, fill = track_title)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~track_title, ncol = 2, scales = "free") +
  coord_flip()

## Selecting by tf_idf

Some tracks have more than 5 words listed, because of ties, but this plot helps us to look for commonalities and differences across the tracks. There is a strong religious theme across many of the tracks, with concepts like "pray", "god", "grace", and "angel" coming up in many tracks. The song "Patricia" uses many positively-valenced words like "wonderful" and "believer". "No Choir" references music-themed words. And "Sky Full of Song" references things that fly (like "arrow") and things in the sky (like "thunder").

What does Florence Welch have to say about the meaning behind this album?

There is loneliness in this record, and there's issues, and pain, and things that I struggled with, but the overriding feeling is that I have hope about them, and that's what kinda brought me to this title; I was gonna call it The End of Love, which I actually saw as a positive thing cause it was the end of a needy kind of love, it was the end of a love that comes from a place of lack, it's about a love that's bigger and broader, that takes so much explaining. It could sound a bit negative but I didn't really think of it that way.

She's also mentioned that High as Hope is the first album she made sober, so her past struggles with addiction are certainly also a theme of the album. And many reviews of the album (like this one) talk about the meaning and stories behind the music. This information can provide some context to the TF-IDF results.

Friday, June 22, 2018

Paul McCartney on Carpool Karaoke

If you haven't watched yet, you need to check out the Carpool Karaoke that had James Corden in tears:

Thursday, June 14, 2018

Working with Your Facebook Data in R

How to Read in and Clean Your Facebook Data - I recently learned that you can download all of your Facebook data, so I decided to check it out and bring it into R. To access your data, go to Facebook, and click on the white down arrow in the upper-right corner. From there, select Settings, then, from the column on the left, "Your Facebook Information." When you get the Facebook Information screen, select "View" next to "Download Your Information." On this screen, you'll be able to select the kind of data you want, a date range, and format. I only wanted my posts, so under "Your Information," I deselected everything but the first item on the list, "Posts." (Note that this will still download all photos and videos you posted, so it will be a large file.) To make it easy to bring into R, I selected JSON under Format (the other option is HTML).

After you click "Create File," it will take a while to compile - you'll get an email when it's ready. You'll need to reenter your password when you go to download the file.

The result is a Zip file, which contains folders for Posts, Photos, and Videos. Posts includes your own posts (on your and others' timelines) as well as posts from others on your timeline. And, of course, the file needed a bit of cleaning. Here's what I did.

Since the post data is a JSON file, I need the jsonlite package to read it.

setwd("C:/Users/slocatelli/Downloads/facebook-saralocatelli35/posts")
library(jsonlite)

FBposts <- fromJSON("your_posts.json")

This creates a large list object, with my data in a data frame. So as I did with the Taylor Swift albums, I can pull out that data frame.

myposts <- FBposts$status_updates

The resulting data frame has 5 columns: timestamp, which is in UNIX format; attachments, any photos, videos, URLs, or Facebook events attached to the post; title, which always starts with the author of the post (you or your friend who posted on your timeline) followed by the type of post; data, the text of the post; and tags, the people you tagged in the post.

First, I converted the timestamp to datetime, using the anytime package.

library(anytime)

myposts$timestamp <- anytime(myposts$timestamp)

Next, I wanted to pull out post author, so that I could easily filter the data frame to only use my own posts.

library(tidyverse)

myposts$author <- word(string = myposts$title, start = 1, end = 2, sep = fixed(" "))

Finally, I was interested in extracting URLs I shared (mostly from YouTube or my own blog) and the text of my posts, which I did with some regular expression functions and some help from Stack Overflow (here and here).

url_pattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"

myposts$links <- str_extract(myposts$attachments, url_pattern)

library(qdapRegex)

myposts$posttext <- myposts$data %>%
  rm_between('"','"',extract = TRUE)

There's more cleaning I could do, but this gets me a data frame I could use for some text analysis. Let's look at my most frequent words.

myposts$posttext <- as.character(myposts$posttext)
library(tidytext)
mypost_text <- myposts %>%
  unnest_tokens(word, posttext) %>%
  anti_join(stop_words)

## Joining, by = "word"

counts <- mypost_text %>%
  filter(author == "Sara Locatelli") %>%
  drop_na(word) %>%
  count(word, sort = TRUE)

counts

## # A tibble: 9,753 x 2
##    word         n
##    <chr>    <int>
##  1 happy     4702
##  2 birthday  4643
##  3 today's    666
##  4 song       648
##  5 head       636
##  6 day        337
##  7 post       321
##  8 009f       287
##  9 ð          287
## 10 008e       266
## # ... with 9,743 more rows

These data include all my posts, including writing "Happy birthday" on other's timelines. I also frequently post the song in my head when I wake up in the morning (over 600 times, it seems). If I wanted to remove those, and only include times I said happy or song outside of those posts, I'd need to apply the filter in a previous step. There are also some strange characters that I want to clean from the data before I do anything else with them. I can easily remove these characters and numbers with string detect, but cells that contain numbers and letters, such as "008e" won't be cut out with that function. So I'll just filter them out separately.

drop_nums <- c("008a","008e","009a","009c","009f")

counts <- counts %>%
  filter(str_detect(word, "[a-z]+"),
         !word %in% str_detect(word, "[0-9]"),
         !word %in% drop_nums)

Now I could, for instance, create a word cloud.

library(wordcloud)

counts %>%
  with(wordcloud(word, n, max.words = 50))

In addition to posting for birthdays and head songs, I talk a lot about statistics, data, analysis, and my blog. I also post about beer, concerts, friends, books, and Chicago. Let's see what happens if I mix in some sentiment analysis to my word cloud.

library(reshape2)

## 
## Attaching package: 'reshape2'

counts %>%
  inner_join(get_sentiments("bing")) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("red","blue"), max.words = 100)

## Joining, by = "word"

Once again, a few words are likely being misclassified - regression and plot are both negatively-valenced, but I imagine I'm using them in the statistical sense instead of the negative sense. I also apparently use "died" or "die" but I suspect in the context of, "I died laughing at this." And "happy" is huge, because it includes birthday wishes as well as instances where I talk about happiness. Some additional cleaning and exploration of the data is certainly needed. But that's enough to get started with this huge example of "me-search."

Tuesday, May 22, 2018

How Has Taylor Swift's Word Choice Changed Over Time?

How Has Taylor Swift's Word Choice Changed Over Time? Sunday night was a big night for Taylor Swift - not only was she nominated for multiple Billboard Music Awards; she took home Top Female Artist and Top Selling Album. So I thought it was a good time for some more Taylor Swift-themed statistical analysis.

When I started this blog back in 2011, my goal was to write deep thoughts on trivial topics - specifically, to overthink and overanalyze pop culture and related topics that appear fluffy until you really dig into them. Recently, I've been blogging more about statistics, research, R, and data science, and I've loved getting to teach and share.

But sometimes, you just want to overthink and overanalyze pop culture.

So in a similar vein to the text analysis I've been demonstrating on my blog, I decided to answer a question I'm sure we all have - as Taylor Swift moved from country sweetheart to mega pop star, how have the words she uses in her songs changed?

I've used the geniusR package on a couple posts, and I'll be using it again today to answer this question. I'll be pulling in some additional code, some based on code from the Text Mining with R: A Tidy Approach book I recently devoured, some written to try to tackle this problem I've created for myself to solve. I've shared all my code and tried to credit those who helped me write it where I can.

First, we want to pull in the names of Taylor Swift's 6 studio albums. I found these and their release dates on Wikipedia. While there are only 6 and I could easily copy and paste them to create my data frame, I wanted to pull that data directly from Wikipedia, to write code that could be used on a larger set in the future. Thanks to this post, I could, with a couple small tweaks.

library(rvest)

## Loading required package: xml2

TSdisc <- 'https://en.wikipedia.org/wiki/Taylor_Swift_discography'

disc <- TSdisc %>%
  read_html() %>%
  html_nodes(xpath = '//*[@id="mw-content-text"]/div/table[2]') %>%
  html_table(fill = TRUE)

Since html() is deprecated, I replaced it with read_html(), and I got errors if I didn't add fill = TRUE. The result is a list of 1, with an 8 by 14 data frame within that single list object. I can pull that out as a separate data frame.

TS_albums <- disc[[1]]

The data frame requires a little cleaning. First up, there are 8 rows, but only 6 albums. Because the Wikipedia table had a double header, the second header was read in as a row of data, so I want to delete that, because I only care about the first two columns anyway. The last row contains a footnote that was included with the table. So I removed those two rows, first and last, and dropped the columns I don't need. Second, the information I want with release date was in a table cell along with record label and formats (e.g., CD, vinyl). I don't need those for my purposes, so I'll only pull out the information I want and drop the rest. Finally, I converted year from character to numeric - this becomes important later on.

library(tidyverse)

TS_albums <- TS_albums[2:7,1:2] %>%
  separate(`Album details`, c("Released","Month","Day","Year"),
           extra='drop') %>%
  select(c("Title","Year"))

TS_albums$Year<-as.numeric(TS_albums$Year)

I asked geniusR to download lyrics for all 6 albums. (Note: this code may take a couple minutes to run.) It nests all of the individual album data, including lyrics, into a single column, so I just need to unnest that to create a long file, with album title and release year applied to each unnested line.

library(geniusR)

TS_lyrics <- TS_albums %>%
  mutate(tracks = map2("Taylor Swift", Title, genius_album))

## Joining, by = c("track_title", "track_n", "track_url")
## Joining, by = c("track_title", "track_n", "track_url")
## Joining, by = c("track_title", "track_n", "track_url")
## Joining, by = c("track_title", "track_n", "track_url")
## Joining, by = c("track_title", "track_n", "track_url")
## Joining, by = c("track_title", "track_n", "track_url")

TS_lyrics <- TS_lyrics %>%
  unnest(tracks)

Now we'll tokenize our lyrics data frame, and start doing our word analysis.

library(tidytext)

tidy_TS <- TS_lyrics %>%
  unnest_tokens(word, lyric) %>%
  anti_join(stop_words)

## Joining, by = "word"

tidy_TS %>%
  count(word, sort = TRUE)

## # A tibble: 2,024 x 2
##    word      n
##    <chr> <int>
##  1 time    198
##  2 love    180
##  3 baby    118
##  4 ooh     104
##  5 stay     89
##  6 night    85
##  7 wanna    84
##  8 yeah     83
##  9 shake    80
## 10 ey       72
## # ... with 2,014 more rows

There are a little over 2,000 unique words across TS's 6 albums. But how have they changed over time? To examine this, I'll create a dataset that counts word by year (or album, really). Then I'll use a binomial regression model to look at changes over time, one model per word. In their book, Julia Silge and David Robinson demonstrated how to use binomial regression to examine word use on the authors' Twitter accounts over time, including an adjustment to the p-values to correct for multiple comparisons. So I based on my code off that.

words_by_year <- tidy_TS %>%
  count(Year, word) %>%
  group_by(Year) %>%
  mutate(time_total = sum(n)) %>%
  group_by(word) %>%
  mutate(word_total = sum(n)) %>%
  ungroup() %>%
  rename(count = n) %>%
  filter(word_total > 50)

nested_words <- words_by_year %>%
  nest(-word)

word_models <- nested_words %>%
  mutate(models = map(data, ~glm(cbind(count, time_total) ~ Year, .,
                                 family = "binomial")))

This nests our regression results in a data frame called word_models. While I could unnest and keep all, I don't care about every value the GLM gives me. What I care about is the slope for Year, so the filter selects only that slope and the associated p-value. I can then filter to select the significant/marginally significant slopes for plotting (p < 0.1).

library(broom)

slopes <- word_models %>%
  unnest(map(models, tidy)) %>%
  filter(term == "Year") %>%
  mutate(adjusted.p.value = p.adjust(p.value))

top_slopes <- slopes%>%
  filter(adjusted.p.value < 0.1) %>%
  select(-statistic, -p.value)

This gives me five words that show changes in usage over time: bad, call, dancing, eyes, and yeah. We can plot those five words to see how they've changed in usage over her 6 albums. And because I still have my TS_albums data frame, I can use that information to label the axis of my plot (which is why I needed year to be numeric). I also added a vertical line and annotations to note where TS believes she shifted from country to pop.

library(scales)

words_by_year %>%
  inner_join(top_slopes, by = "word") %>%
  ggplot(aes(Year, count/time_total, color = word, lty = word)) +
  geom_line(size = 1.3) +
  labs(x = NULL, y = "Word Frequency") +
  scale_x_continuous(breaks=TS_albums$Year,
                     labels=TS_albums$Title) +
  scale_y_continuous(labels=scales::percent) +
  geom_vline(xintercept = 2014) +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank()) +
  annotate("text", x = c(2009.5,2015.5), y = c(0.025,0.025),
           label = c("Country", "Pop") , size=5)

The biggest change appears to be in the word "call," which she didn't use at all in her self-titled album, and used at low rates until "1989" and, especially, "Reputation." I can ask for a few examples of "call" in her song lyrics, with grep.

library(expss)

callsubset <- TS_lyrics[grep("call", TS_lyrics$lyric),]
callsubset <- callsubset %>%
  select(Title, Year, track_title, lyric)
set.seed(2012)
callsubset<-callsubset[sample(nrow(callsubset), 3), ]
callsubset<-callsubset[order(callsubset$Year),]
as.etable(callsubset, rownames_as_row_labels = FALSE)

Title	Year	track_title	lyric
Speak Now	2010	Back to December (Acoustic)	When your birthday passed, and I didn't call
Red	2012	All Too Well	And you call me up again just to break me like a promise
Reputation	2017	Call It What You Want	Call it what you want, call it what you want, call it

On the other hand, she doesn't sing about "eyes" as much now that she's moved from country to pop.

eyessubset <- TS_lyrics[grep("eyes", TS_lyrics$lyric),]
eyessubset <- eyessubset %>%
  select(Title, Year, track_title, lyric)
set.seed(415)
eyessubset<-eyessubset[sample(nrow(eyessubset), 3), ]
eyessubset<-eyessubset[order(eyessubset$Year),]
as.etable(eyessubset, rownames_as_row_labels = FALSE)

Title	Year	track_title	lyric
Taylor Swift	2006	A Perfectly Good Heart	And realized by the distance in your eyes that I would be the one to fall
Speak Now	2010	Better Than Revenge	I'm just another thing for you to roll your eyes at, honey
Red	2012	State of Grace	Just twin fire signs, four blue eyes

Bet you'll never listen to Taylor Swift the same way again.

A few notes: I opted to examine any slopes with p < 0.10, which is greater than conventional levels of significance; if you look at the adjusted p-value column, though, you'll see that 4 of the 5 are < 0.05 and one is only slightly greater than 0.05. But I made the somewhat arbitrary choice to include only words used more than 50 times across her 6 albums, so I could get different results by changing that filtering value when I create the words_by_time data frame. Feel free to play around and see what you get by using different values!

Sunday, May 20, 2018

Statistics Sunday: Welcome to Sentiment Analysis with "Hotel California"

Welcome to the Hotel California As promised in last week's post, this week: sentiment analysis, also with song lyrics.

Sentiment analysis is a method of natural language processing that involves classifying words in a document based on whether a word is positive or negative, or whether it is related to a set of basic human emotions; the exact results differ based on the sentiment analysis method selected. The tidytext R package has 4 different sentiment analysis methods:

"AFINN" for Finn Årup Nielsen - which classifies words from -5 to +5 in terms of negative or positive valence
"bing" for Bing Liu and colleagues - which classifies words as either positive or negative
"loughran" for Loughran-McDonald - mostly for financial and nonfiction works, which classifies as positive or negative, as well as topics of uncertainty, litigious, modal, and constraining
"nrc" for the NRC lexicon - which classifies words into eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) as well as positive or negative sentiment

Sentiment analysis works on unigrams - single words - but you can aggregate across multiple words to look at sentiment across a text.

To demonstrate sentiment analysis, I'll use one of my favorite songs: "Hotel California" by the Eagles.

I know, I know.

The Dude has had a rough night and he hates the f'ing Eagles.

Using similar code as last week, let's pull in the lyrics of the song.

library(geniusR)
library(tidyverse)

hotel_calif <- genius_lyrics(artist = "Eagles", song = "Hotel California") %>%
  mutate(line = row_number())

First, we'll chop up these 43 lines into individual words, using the tidytext package and unnest_tokens function.

library(tidytext)
tidy_hc <- hotel_calif %>%
  unnest_tokens(word,lyric)

This is also probably the point I would remove stop words with anti_join. But these common words are very unlikely to have a sentiment attached to them, so I'll leave them in, knowing they'll be filtered out anyway by this analysis. We have 4 lexicons to choose from. Loughran is more financial and textual, but we'll still see how well it can classify the words anyway. First, let's create a data frame of our 4 sentiment lexicons.

new_sentiments <- sentiments %>%
  mutate(sentiment = ifelse(lexicon == "AFINN" & score >= 0, "positive",
                             ifelse(lexicon == "AFINN" & score < 0,
                                    "negative", sentiment))) %>%
  group_by(lexicon) %>%
  mutate(words_in_lexicon = n_distinct(word)) %>%
  ungroup()

Now, we'll see how well the 4 lexicons match up with the words in the lyrics. Big thanks to Debbie Liske at Data Camp for this piece of code (and several other pieces used in this post):

my_kable_styling <- function(dat, caption) {
  kable(dat, "html", escape = FALSE, caption = caption) %>%
    kable_styling(bootstrap_options = c("striped", "condensed", "bordered"),
                  full_width = FALSE)
}


library(kableExtra)
library(formattable)
library(yarrr)

tidy_hc %>%
  mutate(words_in_lyrics = n_distinct(word)) %>%
  inner_join(new_sentiments) %>%
  group_by(lexicon, words_in_lyrics, words_in_lexicon) %>%
  summarise(lex_match_words = n_distinct(word)) %>%
  ungroup() %>%
  mutate(total_match_words = sum(lex_match_words),
         match_ratio = lex_match_words/words_in_lyrics) %>%
  select(lexicon, lex_match_words, words_in_lyrics, match_ratio) %>%
  mutate(lex_match_words = color_bar("lightblue")(lex_match_words),
         lexicon = color_tile("lightgreen","lightgreen")(lexicon)) %>%
  my_kable_styling(caption = "Lyrics Found In Lexicons")

## Joining, by = "word"

Lyrics Found In Lexicons
lexicon	lex_match_words	words_in_lyrics	match_ratio
AFINN	18	175	0.1028571
bing	18	175	0.1028571
loughran	1	175	0.0057143
nrc	23	175	0.1314286

NRC offers the best match, classifying about 13% of the words in the lyrics. (It's not unusual to have such a low percentage. Not all words have a sentiment.)

hcsentiment <- tidy_hc %>%
  inner_join(get_sentiments("nrc"), by = "word")

hcsentiment

## # A tibble: 103 x 4
##    track_title       line word   sentiment
##    <chr>            <int> <chr>  <chr>    
##  1 Hotel California     1 dark   sadness  
##  2 Hotel California     1 desert anger    
##  3 Hotel California     1 desert disgust  
##  4 Hotel California     1 desert fear     
##  5 Hotel California     1 desert negative 
##  6 Hotel California     1 desert sadness  
##  7 Hotel California     1 cool   positive 
##  8 Hotel California     2 smell  anger    
##  9 Hotel California     2 smell  disgust  
## 10 Hotel California     2 smell  negative 
## # ... with 93 more rows

Let's visualize the counts of different emotions and sentiments in the NRC lexicon.

theme_lyrics <- function(aticks = element_blank(),
                         pgminor = element_blank(),
                         lt = element_blank(),
                         lp = "none")
{
  theme(plot.title = element_text(hjust = 0.5), #Center the title
        axis.ticks = aticks, #Set axis ticks to on or off
        panel.grid.minor = pgminor, #Turn the minor grid lines on or off
        legend.title = lt, #Turn the legend title on or off
        legend.position = lp) #Turn the legend on or off
}

hcsentiment %>%
  group_by(sentiment) %>%
  summarise(word_count = n()) %>%
  ungroup() %>%
  mutate(sentiment = reorder(sentiment, word_count)) %>%
  ggplot(aes(sentiment, word_count, fill = -word_count)) +
  geom_col() +
  guides(fill = FALSE) +
  theme_minimal() + theme_lyrics() +
  labs(x = NULL, y = "Word Count") +
  ggtitle("Hotel California NRC Sentiment Totals") +
  coord_flip()

Most of the words appear to be positively-valenced. How do the individual words match up?

library(ggrepel)

plot_words <- hcsentiment %>%
  group_by(sentiment) %>%
  count(word, sort = TRUE) %>%
  arrange(desc(n)) %>%
  ungroup()

plot_words %>%
  ggplot(aes(word, 1, label = word, fill = sentiment)) +
  geom_point(color = "white") +
  geom_label_repel(force = 1, nudge_y = 0.5,
                   direction = "y",
                   box.padding = 0.04,
                   segment.color = "white",
                   size = 3) +
  facet_grid(~sentiment) +
  theme_lyrics() +
  theme(axis.text.y = element_blank(), axis.line.x = element_blank(),
        axis.title.x = element_blank(), axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        panel.grid = element_blank(), panel.background = element_blank(),
        panel.border = element_rect("lightgray", fill = NA),
        strip.text.x = element_text(size = 9)) +
  xlab(NULL) + ylab(NULL) +
  ggtitle("Hotel California Words by NRC Sentiment") +
  coord_flip()

It looks like some words are being misclassified. For instance, "smell" as in "warm smell of colitas" is being classified as anger, disgust, and negative. But that doesn't explain the overall positive bent being applied to the song. If you listen to the song, you know it's not really a happy song. It starts off somewhat negative - or at least, ambiguous - as the narrator is driving on a dark desert highway. He's tired and having trouble seeing, and notices the Hotel California, a shimmering oasis on the horizon. He stops in and is greated by a "lovely face" in a "lovely place." At the hotel, everyone seems happy: they dance and drink, they have fancy cars, they have pretty "friends."

But the song is in a minor key. Though not always a sign that a song is sad, it is, at the very least, a hint of something ominous, lurking below the surface. Soon, things turn bad for the narrator. The lovely-faced woman tells him they are "just prisoners here of our own device." He tries to run away, but the night man tells him, "You can check out anytime you like, but you can never leave."

The song seems to be a metaphor for something, perhaps fame and excess, which was also the subject of another song on the same album, "Life in the Fast Lane." To someone seeking fame, life is dreary, dark, and deserted. Fame is like an oasis - beautiful and shimmering, an escape. But it isn't all it appears to be. You may be surrounded by beautiful people, but you can only call them "friends." You trust no one. And once you join that lifestyle, you might be able to check out, perhaps through farewell tour(s), but you can never leave that life - people know who you are (or were) and there's no disappearing. And it could be about something even darker that it's hard to escape from, like substance abuse. Whatever meaning you ascribe to the song, the overall message seems to be that things are not as wonderful as they appear on the surface.

So if we follow our own understanding of the song's trajectory, we'd say it starts off somewhat negatively, becomes positive in the middle, then dips back into the negative at the end, when the narrator tries to escape and finds he cannot.

We can chart this, using the line number, which coincides with the location of the word in the song. We'll stick with NRC since it offered the best match, but for simplicity, we'll only pay attention to the positive and negative sentiment codes.

hcsentiment_index <- tidy_hc %>%
  inner_join(get_sentiments("nrc")%>%
               filter(sentiment %in% c("positive",
                                       "negative"))) %>%
  count(index = line, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

## Joining, by = "word"

This gives us a data frame that aggregates sentiment by line. If a line contains more positive than negative words, its overall sentiment is positive, and vice versa. Because not every word in the lyrics has a sentiment, not every line has an associated aggregate sentiment. But it gives us a sort of trajectory over the course of the song. We can visualize this trajectory like this:

hcsentiment_index %>%
  ggplot(aes(index, sentiment, fill = sentiment > 0)) +
  geom_col(show.legend = FALSE)

As the chart shows, the song starts somewhat positive, with a dip soon after into the negative. The middle of the song is positive, as the narrator describes the decadence of the Hotel California. But it turns dark at the end, and stays that way as the guitar solo soars in.

Sources

This awesome post by Debbie Liske, mentioned earlier, for her code and custom functions to make my charts pretty.

Text Mining with R: A Tidy Approach by Julia Silge and David Robinson