Sunday, August 12, 2018

Statistics Sunday: Getting Started with the Russian Tweet Dataset

IRA Tweet Data You may have heard that two researchers at Clemson University analyzed almost 3 millions tweets from the Internet Research Agency (IRA) - a "Russian troll factory". In partnership with FiveThirtyEight, they made all of their data available on GitHub. So of course, I had to read the files into R, which I was able to do with this code:

files <- c("IRAhandle_tweets_1.csv",
           "IRAhandle_tweets_2.csv",
           "IRAhandle_tweets_3.csv",
           "IRAhandle_tweets_4.csv",
           "IRAhandle_tweets_5.csv",
           "IRAhandle_tweets_6.csv",
           "IRAhandle_tweets_7.csv",
           "IRAhandle_tweets_8.csv",
           "IRAhandle_tweets_9.csv")
my_files <- paste0("~/Downloads/russian-troll-tweets-master/",files)

each_file <- function(file) {
  tweet <- read_csv(file) }

library(tidyverse)
tweet_data <- NULL
for (file in my_files) {
  temp <- each_file(file)
  temp$id <- sub(".csv", "", file)
  tweet_data <- rbind(tweet_data, temp)
}

Note that this is a large file, with 2,973,371 observations of 16 variables. Let's do some cleaning of this dataset first. The researchers, Darren Linvill and Patrick Warren, identified 5 majors types of trolls:
  • Right Troll: These Trump-supporting trolls voiced right-leaning, populist messages, but “rarely broadcast traditionally important Republican themes, such as taxes, abortion, and regulation, but often sent divisive messages about mainstream and moderate Republicans…They routinely denigrated the Democratic Party, e.g. @LeroyLovesUSA, January 20, 2017, “#ThanksObama We're FINALLY evicting Obama. Now Donald Trump will bring back jobs for the lazy ass Obamacare recipients,” the authors wrote.
  • Left Troll: These trolls mainly supported Bernie Sanders, derided mainstream Democrats, and focused heavily on racial identity, in addition to sexual and religious identity. The tweets were “clearly trying to divide the Democratic Party and lower voter turnout,” the authors told FiveThirtyEight.
  • News Feed: A bit more mysterious, news feed trolls mostly posed as local news aggregators who linked to legitimate news sources. Some, however, “tweeted about global issues, often with a pro-Russia perspective.”
  • Hashtag Gamer: Gamer trolls used hashtag games—a popular call/response form of tweeting—to drum up interaction from other users. Some tweets were benign, but many “were overtly political, e.g. @LoraGreeen, July 11, 2015, “#WasteAMillionIn3Words Donate to #Hillary.”
  • Fearmonger: These trolls, who were least prevalent in the dataset, spread completely fake news stories, for instance “that salmonella-contaminated turkeys were produced by Koch Foods, a U.S. poultry producer, near the 2015 Thanksgiving holiday.”
But a quick table of the results of the variable, account_category, shows 8 in the dataset.

table(tweet_data$account_category)
## 
##   Commercial   Fearmonger HashtagGamer    LeftTroll     NewsFeed 
##       122582        11140       241827       427811       599294 
##   NonEnglish   RightTroll      Unknown 
##       837725       719087        13905

The additional three are Commercial, Non-English, and Unknown. At the very least, we should drop the Non-English tweets, since those use Russian characters and any analysis I do will assume data are in English. I'm also going to keep only a few key variables. Then I'm going to clean up this dataset to remove links, because I don't need those for my analysis - I certainly wouldn't want to follow them to their destination. If I want to free up some memory, I can then remove the large dataset.

reduced <- tweet_data %>%
  select(author,content,publish_date,account_category) %>%
  filter(account_category != "NonEnglish")

library(qdapRegex)
## 
## Attaching package: 'qdapRegex'
reduced$content <- rm_url(reduced$content)

rm(tweet_data)

Now we have a dataset of 2,135,646 observations of 4 variables. I'm planning on doing some analysis on my own of this dataset - and will of course share what I find - but for now, I thought I'd repeat a technique I've covered on this blog and demonstrate a new one.

library(tidytext)

tweetwords <- reduced %>%
  unnest_tokens(word, content) %>%
  anti_join(stop_words)
## Joining, by = "word"
wordcounts <- tweetwords %>%
  count(account_category, word, sort = TRUE) %>%
  ungroup()

head(wordcounts)
## # A tibble: 6 x 3
##   account_category word          n
##   <chr>            <chr>     <int>
## 1 NewsFeed         news     124586
## 2 RightTroll       trump     95794
## 3 RightTroll       rt        86970
## 4 NewsFeed         sports    47793
## 5 Commercial       workout   42395
## 6 NewsFeed         politics  38204

First, I'll conduct a TF-IDF analysis of the dataset. This code is a repeat from a previous post.

tweet_tfidf <- wordcounts %>%
  bind_tf_idf(word, account_category, n) %>%
  arrange(desc(tf_idf))

tweet_tfidf %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>%
  group_by(account_category) %>%
  top_n(15) %>%
  ungroup() %>%
  ggplot(aes(word, tf_idf, fill = account_category)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~account_category, ncol = 2, scales = "free") +
  coord_flip()
## Selecting by tf_idf


But another method of examining terms and topics in a set of documents is Latent Dirichlet Allocation (LDA), which can be conducted using the R package, topicmodels. The only issue is that LDA requires a document term matrix. But we can easily convert our wordcounts dataset into a DTM with the cast_dtm function from tidytext. Then we run our LDA with topicmodels. Note that LDA is a random technique, so we set a random number seed, and we specify how many topics we want the LDA to extract (k). Since there are 6 account types (plus 1 unknown), I'm going to try having it extract 6 topics. We can see how well they line up with the account types.

tweets_dtm <- wordcounts %>%
  cast_dtm(account_category, word, n)

library(topicmodels)
tweets_lda <- LDA(tweets_dtm, k = 6, control = list(seed = 42))
tweet_topics <- tidy(tweets_lda, matrix = "beta")

Now we can pull out the top terms from this analysis, and plot them to see how they lined up.

top_terms <- tweet_topics %>%
  group_by(topic) %>%
  top_n(15, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~topic, scales = "free") +
  coord_flip()


Based on these plots, I'd say the topics line up very well with the account categories, showing, in order: news feed, left troll, fear monger, right troll, hash gamer, and commercial. One interesting observation, though, is that Trump is a top term in 5 of the 6 topics.

No comments:

Post a Comment