Sunday, August 12, 2018

Statistics Sunday: Getting Started with the Russian Tweet Dataset

IRA Tweet Data You may have heard that two researchers at Clemson University analyzed almost 3 millions tweets from the Internet Research Agency (IRA) - a "Russian troll factory". In partnership with FiveThirtyEight, they made all of their data available on GitHub. So of course, I had to read the files into R, which I was able to do with this code:

files <- c("IRAhandle_tweets_1.csv",
my_files <- paste0("~/Downloads/russian-troll-tweets-master/",files)

each_file <- function(file) {
  tweet <- read_csv(file) }

tweet_data <- NULL
for (file in my_files) {
  temp <- each_file(file)
  temp$id <- sub(".csv", "", file)
  tweet_data <- rbind(tweet_data, temp)

Note that this is a large file, with 2,973,371 observations of 16 variables. Let's do some cleaning of this dataset first. The researchers, Darren Linvill and Patrick Warren, identified 5 majors types of trolls:
  • Right Troll: These Trump-supporting trolls voiced right-leaning, populist messages, but “rarely broadcast traditionally important Republican themes, such as taxes, abortion, and regulation, but often sent divisive messages about mainstream and moderate Republicans…They routinely denigrated the Democratic Party, e.g. @LeroyLovesUSA, January 20, 2017, “#ThanksObama We're FINALLY evicting Obama. Now Donald Trump will bring back jobs for the lazy ass Obamacare recipients,” the authors wrote.
  • Left Troll: These trolls mainly supported Bernie Sanders, derided mainstream Democrats, and focused heavily on racial identity, in addition to sexual and religious identity. The tweets were “clearly trying to divide the Democratic Party and lower voter turnout,” the authors told FiveThirtyEight.
  • News Feed: A bit more mysterious, news feed trolls mostly posed as local news aggregators who linked to legitimate news sources. Some, however, “tweeted about global issues, often with a pro-Russia perspective.”
  • Hashtag Gamer: Gamer trolls used hashtag games—a popular call/response form of tweeting—to drum up interaction from other users. Some tweets were benign, but many “were overtly political, e.g. @LoraGreeen, July 11, 2015, “#WasteAMillionIn3Words Donate to #Hillary.”
  • Fearmonger: These trolls, who were least prevalent in the dataset, spread completely fake news stories, for instance “that salmonella-contaminated turkeys were produced by Koch Foods, a U.S. poultry producer, near the 2015 Thanksgiving holiday.”
But a quick table of the results of the variable, account_category, shows 8 in the dataset.

##   Commercial   Fearmonger HashtagGamer    LeftTroll     NewsFeed 
##       122582        11140       241827       427811       599294 
##   NonEnglish   RightTroll      Unknown 
##       837725       719087        13905

The additional three are Commercial, Non-English, and Unknown. At the very least, we should drop the Non-English tweets, since those use Russian characters and any analysis I do will assume data are in English. I'm also going to keep only a few key variables. Then I'm going to clean up this dataset to remove links, because I don't need those for my analysis - I certainly wouldn't want to follow them to their destination. If I want to free up some memory, I can then remove the large dataset.

reduced <- tweet_data %>%
  select(author,content,publish_date,account_category) %>%
  filter(account_category != "NonEnglish")

## Attaching package: 'qdapRegex'
reduced$content <- rm_url(reduced$content)


Now we have a dataset of 2,135,646 observations of 4 variables. I'm planning on doing some analysis on my own of this dataset - and will of course share what I find - but for now, I thought I'd repeat a technique I've covered on this blog and demonstrate a new one.


tweetwords <- reduced %>%
  unnest_tokens(word, content) %>%
## Joining, by = "word"
wordcounts <- tweetwords %>%
  count(account_category, word, sort = TRUE) %>%

## # A tibble: 6 x 3
##   account_category word          n
##   <chr>            <chr>     <int>
## 1 NewsFeed         news     124586
## 2 RightTroll       trump     95794
## 3 RightTroll       rt        86970
## 4 NewsFeed         sports    47793
## 5 Commercial       workout   42395
## 6 NewsFeed         politics  38204

First, I'll conduct a TF-IDF analysis of the dataset. This code is a repeat from a previous post.

tweet_tfidf <- wordcounts %>%
  bind_tf_idf(word, account_category, n) %>%

tweet_tfidf %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>%
  group_by(account_category) %>%
  top_n(15) %>%
  ungroup() %>%
  ggplot(aes(word, tf_idf, fill = account_category)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~account_category, ncol = 2, scales = "free") +
## Selecting by tf_idf

But another method of examining terms and topics in a set of documents is Latent Dirichlet Allocation (LDA), which can be conducted using the R package, topicmodels. The only issue is that LDA requires a document term matrix. But we can easily convert our wordcounts dataset into a DTM with the cast_dtm function from tidytext. Then we run our LDA with topicmodels. Note that LDA is a random technique, so we set a random number seed, and we specify how many topics we want the LDA to extract (k). Since there are 6 account types (plus 1 unknown), I'm going to try having it extract 6 topics. We can see how well they line up with the account types.

tweets_dtm <- wordcounts %>%
  cast_dtm(account_category, word, n)

tweets_lda <- LDA(tweets_dtm, k = 6, control = list(seed = 42))
tweet_topics <- tidy(tweets_lda, matrix = "beta")

Now we can pull out the top terms from this analysis, and plot them to see how they lined up.

top_terms <- tweet_topics %>%
  group_by(topic) %>%
  top_n(15, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~topic, scales = "free") +

Based on these plots, I'd say the topics line up very well with the account categories, showing, in order: news feed, left troll, fear monger, right troll, hash gamer, and commercial. One interesting observation, though, is that Trump is a top term in 5 of the 6 topics.

Wednesday, August 8, 2018

Cowboy Bebop 20 Years Later

I've been asked before what made me fall in love with corgis. Did I idolize British royalty? Did I have one or know someone who had one as a child? Did I just think their stumpy legs and enormous ears were adorable? (Well, yeah.)

But the real reason? I started watching Cowboy Bebop in college and fell in love with Ein, the adorable corgi on the show. Yes, I love corgis because I loved this anime.

For anyone who has seen this anime, all I have to do is say I love Cowboy Bebop and they furiously nod in agreement, perhaps singing some of their favorite songs from the show or doing their best Spike Spiegel impression. For anyone who hasn't seen it, it's difficult to describe the show because it defies categorization. But this article does a wonderful job of summing it up:
Cowboy Bebop is its own thing. You could call it a space western. That’s, however, like saying BeyoncĂ© is a good dancer. It’s true. But it’s not the whole picture. Not even close. Cowboy Bebop is a mashup of noir films, spaghetti westerns, urban thrillers, Kurosawa samurai films, classic westerns and sci-fi space adventures.

Again, though, all that genre bending is just part of the equation. The show resonates so deeply because it’s a mirror in which you can see yourself, and how we all wrestle with life. This is what makes Cowboy Bebop great art. It’s a beautifully complex, aesthetically striking meditation on how we deal with love, loss, luck and that inescapable question: Why should I give a fuck?
It's been 20 years since Cowboy Bebop premiered, and those 26 episodes and 1 movie remain relevant and influential today. Time to rewatch the show, I think.

Tuesday, August 7, 2018

Statistics Sunday: Highlighting a Subset of Data in ggplot2

Highlighting Specific Cases in ggplot2 Here's my belated Statistics Sunday post, using a cool technique I just learned about: gghighlight. This R package works with ggplot2 to highlight a subset of data. To demonstrate, I'll use a dataset I analyzed for a previous post about my 2017 reading habits. [Side note: My reading goal for this year is 60 books, and I'm already at 43! I may have to increase my goal at some point.]

books<-read_csv("2017_books.csv", col_names = TRUE)
## Warning: Duplicated column names deduplicated: 'Author' => 'Author_1' [13]
## Parsed with column specification:
## cols(
##   .default = col_integer(),
##   Title = col_character(),
##   Author = col_character(),
##   G_Rating = col_double(),
##   Started = col_character(),
##   Finished = col_character()
## )
## See spec(...) for full column specifications.

One analysis I conducted with this dataset was to look at the correlation between book length (number of pages) and read time (number of days it took to read the book). We can also generate a scatterplot to visualize this relationship.

cor.test(books$Pages, books$Read_Time)
## 	Pearson's product-moment correlation
## data:  books$Pages and books$Read_Time
## t = 3.1396, df = 51, p-value = 0.002812
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1482981 0.6067498
## sample estimates:
##       cor 
## 0.4024597
scatter <- ggplot(books, aes(Pages, Read_Time)) +
  geom_point(size = 3) +
  theme_classic() +
  labs(title = "Relationship Between Reading Time and Page Length") +
  ylab("Read Time (in days)") +
  xlab("Number of Pages") +

There's a significant positive correlation here, meaning the longer books take more days to read. It's a moderate correlation, and there are certainly other variables that may explain why a book took longer to read. For instance, nonfiction books may take longer. Books read in October or November (while I was gearing up for and participating in NaNoWriMo, respectively) may also take longer, since I had less spare time to read. I can conduct regressions and other analyses to examine which variables impact read time, but one of the most important parts of sharing results is creating good data visualizations. How can I show the impact these other variables have on read time in an understandable and visually appealing way?

gghighlight will let me draw attention to different parts of the plot. For example, I can ask gghighlight to draw attention to books that took longer than a certain amount of time to read, and I can even ask it to label those books.

scatter + gghighlight(Read_Time > 14) +
  geom_label(aes(label = Title),
             hjust = 1,
             vjust = 1,
             fill = "blue",
             color = "white",
             alpha = 0.5)

Here, the gghighlight function identifies the subset (books that took more than 2 weeks to read) and labels those books with the Title variable. Three of the four books with long read time values are non-fiction, and one was read for a course I took, so reading followed a set schedule. But the fourth is a fiction book, which took over 20 days to read. Let's see how month impacts reading time, by highlighting books read in November. To do that, I'll need to alter my dataset somewhat. The dataset contains a starting date and finish date, which were read in as characters. I need to convert those to dates and pull out the month variable to create my indicator.

## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##     date
books$Started <- mdy(books$Started)
books$Start_Month <- month(books$Started)
books$Month <- ifelse(books$Start_Month > 10 & books$Start_Month < 12, books$Month <- 1,
                      books$Month <- 0)
scatter + gghighlight(books$Month == 1) +
  geom_label(aes(label = Title), hjust = 1, vjust = 1, fill = "blue", color = "white", alpha = 0.5)

The book with the longest read time was, in fact, read during November, when I was spending most of my time writing.

Monday, August 6, 2018

Fun Weekend

I had a full, fun weekend, hence no Statistics Sunday post.

Saturday, I went to the Bristol Renaissance Faire - the theme was magic and monsters, so I dressed in shades of blue (water) and perched my knitted Loch Ness Monster on my shoulder. Only a handful of people got my full costume but I still got lots of compliments on and reactions to Nessie.

We held out almost 6 hours in the 95+ degree heat - LOTS OF WATER. While it may have been silly to go to Faire on such a hot day, I was happy at least that it wasn't raining. And between sunblock and my parasol, I made it home with no sunburns!

After, we dropped by Mars Cheese Castle to cool off and stock up on snacks, meat, and cheese, and I picked up a couple bottles of beer to take home.

Sunday I had some friends over in the afternoon. Their son is really into trains, and a train line runs right behind my house. The back gate opens up right onto the tracks, so we have a great view. After chilling on the grass for a shockingly long time waiting for trains (they usually run almost constantly on Sundays but not yesterday for some reason), we got lucky and had a Metra followed immediately by a freight train.

Last night, Endless Summer wrapped up in La Grange just a few blocks from me, so I watched the fireworks display from the comfort of my place.

Posting may continue to be sporadic for a bit longer, but stay tuned.

Friday, August 3, 2018

Stats Note: Making Sense of Open-Ended Responses with Text Analysis

Using Text Mining on Open Ended Items Good survey design is both art and science. You have to think about how people will read and process your questions, and what sorts of responses might result from different question forms and wording. One of the big rules I follow in survey design is that you don't assess any of your most important topics with an open-ended item. Most people skip them, because they're more work than selecting options from a list, and people who do complete may give you terse, unhelpful, or gibberish answers.

When I was working on a large survey among people with spinal cord injuries/disorders, the survey designers decided to assess the exact details of the respondent's spinal injury or disorder with an open-ended question so that people could describe it "in their own words." As you might have guessed, the data were mostly meaningless and as such unusable. But many hypotheses and research questions dealt with looking at differences between people who sustained an injury and those with a disorder affecting their spinal cord. We couldn't even begin to test or examine any of those in the course of our study. It was unbelievably frustrating, because we could have gotten the information we needed with a single question and some categorical responses. We could have then asked people to supplement their answer with the open-ended question. Most would skip it, but we'd have the data we needed to answer our questions.

In my job, I inherited the results of a large survey involving a variety of dental professional populations. Once again, certain items that could have been addressed with a few close-ended questions were instead open-ended questions and not many of the responses are useful. The item that inspired this blog post assessed the types of products dental assistants are involved in purchasing, which can include anything from office supplies to personal protective equipment to large equipment (X-ray machine, etc.). Everyone had a different way of articulating what they were involved with purchasing, some simply saying "all dental supplies" or "everything," while others gave more specific details. In total, about 400 people responded to this item, which is a lot of data to dig through. But thanks to my new experience with text mining in R, I was able to try to make some sense of responses. Mind you, a lot of the responses can't be understood, but it's at least something.

I decided I could probably begin to categorize responses by slicing the responses up into the individual words. I can generate counts overall as well as by respondent, and use this to begin examining and categorizing the data. Finally, if I need some additional context, I can ask R to give me all responses that contain a certain word.

Because I don't own these data, I can't share them on my blog. But I can demonstrate with different text data to show what I did. To do this, I'll use one of my all-time favorite books, The Wonderful Wizard of Oz by L. Frank Baum, which is available through Project Gutenberg. The gutenbergr package will let me download the full-text.

gutenberg_works(title == "The Wonderful Wizard of Oz")
## # A tibble: 1 x 8
##   gutenberg_id title   author   gutenberg_autho~ language gutenberg_books~
##          <int> <chr>   <chr>               <int> <chr>    <chr>           
## 1           55 The Wo~ Baum, L~               42 en       Children's Lite~
## # ... with 2 more variables: rights <chr>, has_text <lgl>

Now we have the gutenberg_id, which will be used to download the fulltext into a data frame.

WOz <- gutenberg_download(gutenberg_id = 55)
## Determining mirror for Project Gutenberg from
## Using mirror
WOz$line <- row.names(WOz)

To start exploring the data, I'll need to use the tidytext package to unnest tokens (words) and remove stop words that don't tell us much.

tidy_oz <- WOz %>%
  unnest_tokens(word, text) %>%
## Joining, by = "word"

Now I can begin to explore my "responses" by generating overall counts and even counts by respondent. (For the present example, I'll use my line number variable instead of the respondent id number I used in the actual dataset.)

word_counts <- tidy_oz %>%
  count(word, sort = TRUE)
resp_counts <- tidy_oz %>%
  count(line, word, sort = TRUE) %>%

When I look at my overall counts, I see that the most frequently used words are the main characters of the book. So from this, I could generate a category "characters." Words like "Kansas", "Emerald" and "City" are also common, and I could create a category called "places." Finally, "heart" and "brains" are common - a category could be created to encompass what the characters are seeking. Obviously, this might not be true for every instance. It could be that someone "didn't have the heart" to tell someone something. I can try to separate out those instances by looking at the original text.

heart <- WOz[grep("heart",WOz$text),]
## # A tibble: 6 x 3
##   gutenberg_id text                                                  line 
##          <int> <chr>                                                 <chr>
## 1           55 happiness to childish hearts than all other human cr~ 48   
## 2           55 the heartaches and nightmares are left out.           62   
## 3           55 and press her hand upon her heart whenever Dorothy's~ 110  
## 4           55 strange people.  Her tears seemed to grieve the kind~ 375  
## 5           55 Dorothy ate a hearty supper and was waited upon by t~ 517  
## 6           55 She ate a hearty breakfast, and watched a wee Munchk~ 545

Unfortunately, this got me any line containing a word starting with "heart" like "hearty" and "heartache." Let's rewrite that command to request an exact match.

heart <- WOz[grep("\\bheart\\b",WOz$text),]
## # A tibble: 6 x 3
##   gutenberg_id text                                                  line 
##          <int> <chr>                                                 <chr>
## 1           55 and press her hand upon her heart whenever Dorothy's~ 110  
## 2           55 "\"Do you suppose Oz could give me a heart?\""        935  
## 3           55 brains, and a heart also; so, having tried them both~ 975  
## 4           55 "rather have a heart.\""                              976  
## 5           55 grew to love her with all my heart.  She, on her par~ 992  
## 6           55 now no heart, so that I lost all my love for the Mun~ 1024

Now I can use this additional context to determine which instances of "heart" are about the Tin Man's quest.

It seems like these tools would be most useful for open-ended responses that are still very categorical, but it could help with determining themes for more narrative data.

Thursday, August 2, 2018

How Land is Used in the U.S.

Via FiveThirtyEight, this great article in Bloomberg breaks down visually how we use land in the United States:

As you scroll through, you'll see the map change to highlight different uses, and some of them lump the different categories together in a redesigned US map, so you can see just how much space they take up.

Monday, July 30, 2018

Feeling Honored

Awesome news first thing on a Monday! My paper, "Effect of the environment on participation in spinal cord injuries/disorders: The mediating impact of resilience, grief, and self-efficacy," published in Rehabilitation Psychology last year was awarded the Harold Yuker Award for Research Excellence:

This paper was the result of a huge survey, overseen by my post-doc mentor, Sherri LaVela, and worked on by many of my amazing VA colleagues. The paper itself uses latent variable path analysis to examine how resilience, grief, and self-efficacy among individuals with spinal cord injuries/disorders mediates to the effect of environmental barriers on ability to participate. The message of the paper was that, while improving environmental barriers is key to increasing participation, by intervening to increase resilience and self-efficacy, and decrease feelings of grief/loss over the injury/disorder, we can impact participation, even if we can't directly intervene to improve all of the environmental barriers.

So cool to get recognition that one of my favorite and most personally meaningful papers was also viewed as meaningful and important to experts in the field.