Sunday, August 12, 2018

Statistics Sunday: Getting Started with the Russian Tweet Dataset

IRA Tweet Data You may have heard that two researchers at Clemson University analyzed almost 3 millions tweets from the Internet Research Agency (IRA) - a "Russian troll factory". In partnership with FiveThirtyEight, they made all of their data available on GitHub. So of course, I had to read the files into R, which I was able to do with this code:

files <- c("IRAhandle_tweets_1.csv",
my_files <- paste0("~/Downloads/russian-troll-tweets-master/",files)

each_file <- function(file) {
  tweet <- read_csv(file) }

tweet_data <- NULL
for (file in my_files) {
  temp <- each_file(file)
  temp$id <- sub(".csv", "", file)
  tweet_data <- rbind(tweet_data, temp)

Note that this is a large file, with 2,973,371 observations of 16 variables. Let's do some cleaning of this dataset first. The researchers, Darren Linvill and Patrick Warren, identified 5 majors types of trolls:
  • Right Troll: These Trump-supporting trolls voiced right-leaning, populist messages, but “rarely broadcast traditionally important Republican themes, such as taxes, abortion, and regulation, but often sent divisive messages about mainstream and moderate Republicans…They routinely denigrated the Democratic Party, e.g. @LeroyLovesUSA, January 20, 2017, “#ThanksObama We're FINALLY evicting Obama. Now Donald Trump will bring back jobs for the lazy ass Obamacare recipients,” the authors wrote.
  • Left Troll: These trolls mainly supported Bernie Sanders, derided mainstream Democrats, and focused heavily on racial identity, in addition to sexual and religious identity. The tweets were “clearly trying to divide the Democratic Party and lower voter turnout,” the authors told FiveThirtyEight.
  • News Feed: A bit more mysterious, news feed trolls mostly posed as local news aggregators who linked to legitimate news sources. Some, however, “tweeted about global issues, often with a pro-Russia perspective.”
  • Hashtag Gamer: Gamer trolls used hashtag games—a popular call/response form of tweeting—to drum up interaction from other users. Some tweets were benign, but many “were overtly political, e.g. @LoraGreeen, July 11, 2015, “#WasteAMillionIn3Words Donate to #Hillary.”
  • Fearmonger: These trolls, who were least prevalent in the dataset, spread completely fake news stories, for instance “that salmonella-contaminated turkeys were produced by Koch Foods, a U.S. poultry producer, near the 2015 Thanksgiving holiday.”
But a quick table of the results of the variable, account_category, shows 8 in the dataset.

##   Commercial   Fearmonger HashtagGamer    LeftTroll     NewsFeed 
##       122582        11140       241827       427811       599294 
##   NonEnglish   RightTroll      Unknown 
##       837725       719087        13905

The additional three are Commercial, Non-English, and Unknown. At the very least, we should drop the Non-English tweets, since those use Russian characters and any analysis I do will assume data are in English. I'm also going to keep only a few key variables. Then I'm going to clean up this dataset to remove links, because I don't need those for my analysis - I certainly wouldn't want to follow them to their destination. If I want to free up some memory, I can then remove the large dataset.

reduced <- tweet_data %>%
  select(author,content,publish_date,account_category) %>%
  filter(account_category != "NonEnglish")

## Attaching package: 'qdapRegex'
reduced$content <- rm_url(reduced$content)


Now we have a dataset of 2,135,646 observations of 4 variables. I'm planning on doing some analysis on my own of this dataset - and will of course share what I find - but for now, I thought I'd repeat a technique I've covered on this blog and demonstrate a new one.


tweetwords <- reduced %>%
  unnest_tokens(word, content) %>%
## Joining, by = "word"
wordcounts <- tweetwords %>%
  count(account_category, word, sort = TRUE) %>%

## # A tibble: 6 x 3
##   account_category word          n
##   <chr>            <chr>     <int>
## 1 NewsFeed         news     124586
## 2 RightTroll       trump     95794
## 3 RightTroll       rt        86970
## 4 NewsFeed         sports    47793
## 5 Commercial       workout   42395
## 6 NewsFeed         politics  38204

First, I'll conduct a TF-IDF analysis of the dataset. This code is a repeat from a previous post.

tweet_tfidf <- wordcounts %>%
  bind_tf_idf(word, account_category, n) %>%

tweet_tfidf %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>%
  group_by(account_category) %>%
  top_n(15) %>%
  ungroup() %>%
  ggplot(aes(word, tf_idf, fill = account_category)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~account_category, ncol = 2, scales = "free") +
## Selecting by tf_idf

But another method of examining terms and topics in a set of documents is Latent Dirichlet Allocation (LDA), which can be conducted using the R package, topicmodels. The only issue is that LDA requires a document term matrix. But we can easily convert our wordcounts dataset into a DTM with the cast_dtm function from tidytext. Then we run our LDA with topicmodels. Note that LDA is a random technique, so we set a random number seed, and we specify how many topics we want the LDA to extract (k). Since there are 6 account types (plus 1 unknown), I'm going to try having it extract 6 topics. We can see how well they line up with the account types.

tweets_dtm <- wordcounts %>%
  cast_dtm(account_category, word, n)

tweets_lda <- LDA(tweets_dtm, k = 6, control = list(seed = 42))
tweet_topics <- tidy(tweets_lda, matrix = "beta")

Now we can pull out the top terms from this analysis, and plot them to see how they lined up.

top_terms <- tweet_topics %>%
  group_by(topic) %>%
  top_n(15, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~topic, scales = "free") +

Based on these plots, I'd say the topics line up very well with the account categories, showing, in order: news feed, left troll, fear monger, right troll, hash gamer, and commercial. One interesting observation, though, is that Trump is a top term in 5 of the 6 topics.


  1. looks like a typo: read_csv should be read.csv

    1. Hi Stephen -

      Actually, read_csv is a tidyverse function (specifically from the readr package), that creates a tibble. You can learn more about tibbles here:

  2. Using underscore is very not R style. You can see all base and recommended packages do not use that in names. There is a reason: R power users use ESS and the underscore is tied to the <- so it becomes almost impossible to use such names within ESS. the creators of the library must be using R occasionally and not as primary tool.

    1. What is "R style"? Where do you find the list of "recommended packages"? Are all R users who do not use ESS unworthy of the title "R power users"? Have you checked the most downloaded packages in R over the past few years?

    2. This comment has been removed by the author.


  3. The recommended packages are those that are recommended by the R developer team who are mostly statisticians. There is no official style guide, but it is plain obvious that the underscore is not used by the R developer team. neither base nor recommended packages use the underscore

    1. Looks like you have your own comment trolls to conduct analysis on! Might be worth disabling comments in general.

  4. Hi I was wondering if you have undertaken a network analysis of who retweets whom in the network. Analysing the network of trolls do you see any central nodes or behaviours?