Thursday, August 30, 2018

Today's Reading List

I'm wrapping up a few department requests as we head into the end of our fiscal year, so these tabs will have to wait until later:

Back to work.

Monday, August 27, 2018

Statistics Sunday: Visualizing Regression

Statistics Sunday: Visualizing Regression I had some much needed downtime this weekend, after an exhausting week, along with some self-care - Saturday I had a one-hour deep tissue massage, which left me a little bruised but much more relaxed, and Sunday I spent a few hours in the salon chair having my color touched up, which left me much blonder. Which is why I'm a little late with my Statistics Sunday post, but today, I'm introducing another recently discovered r package: rpart. Short for "recursive partitioning," this package creates decision trees for classification, regression, and survival analyses. Today, I'm going to demonstrate using the rpart package for visualizing regression.

To demonstrate this technique, I'm using my 2017 reading dataset. A reader requested I make this dataset available, which I have done - you can download it here. This post describes the data in more detail, but the short description is that this dataset contains the 53 books I read last year, with information on book genre, page length, how long it took me to read it, and two ratings: my own rating and the average rating on Goodreads. Look for another dataset soon, containing my 2018 reading data; I made a goal of 60 books and I've already read 50, meaning lots more data than last year. I'm thinking of going for broke and bumping my reading goal up to 100, because apparently 60 books is not enough of a challenge for me now that I spend so much downtime reading.

First, I'll load my dataset, then conduct the basic linear model I demonstrated in the post linked above.

setwd("~/R")
library(tidyverse)
## Warning: Duplicated column names deduplicated: 'Author' => 'Author_1' [13]
## Parsed with column specification:
## cols(
##   .default = col_integer(),
##   Title = col_character(),
##   Author = col_character(),
##   G_Rating = col_double(),
##   Started = col_character(),
##   Finished = col_character()
## )
## See spec(...) for full column specifications.
colnames(books)[13] <- "Author_Gender"
myrating<-lm(My_Rating ~ Pages + Read_Time + Author_Gender + Fiction + Fantasy + Math_Stats + YA_Fic, data=books)
summary(myrating)
## 
## Call:
## lm(formula = My_Rating ~ Pages + Read_Time + Author_Gender + 
##     Fiction + Fantasy + Math_Stats + YA_Fic, data = books)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73120 -0.34382 -0.00461  0.24665  1.49932 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.5861211  0.2683464  13.364   <2e-16 ***
## Pages          0.0019578  0.0007435   2.633   0.0116 *  
## Read_Time     -0.0244168  0.0204186  -1.196   0.2380    
## Author_Gender -0.1285178  0.1666207  -0.771   0.4445    
## Fiction        0.1052319  0.2202581   0.478   0.6351    
## Fantasy        0.5234710  0.2097386   2.496   0.0163 *  
## Math_Stats    -0.2558926  0.2122238  -1.206   0.2342    
## YA_Fic        -0.7330553  0.2684623  -2.731   0.0090 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4952 on 45 degrees of freedom
## Multiple R-squared:  0.4624, Adjusted R-squared:  0.3788 
## F-statistic:  5.53 on 7 and 45 DF,  p-value: 0.0001233

These analyses show that I give higher ratings for books that are longer and Fantasy genre, and lower ratings to books that are Young Adult Fiction. Now let's see what happens if I run this same linear model through rpart. Note that this is a slightly different technique, looking for cuts that differentiate outcomes, so it will find slightly different results.

library(rpart)
tree1 <- rpart(My_Rating ~ Pages + Read_Time + Author_Gender + Fiction + Fantasy + Math_Stats + YA_Fic, method = "anova", data=books)
printcp(tree1)
## 
## Regression tree:
## rpart(formula = My_Rating ~ Pages + Read_Time + Author_Gender + 
##     Fiction + Fantasy + Math_Stats + YA_Fic, data = books, method = "anova")
## 
## Variables actually used in tree construction:
## [1] Fantasy    Math_Stats Pages     
## 
## Root node error: 20.528/53 = 0.38733
## 
## n= 53 
## 
##         CP nsplit rel error  xerror    xstd
## 1 0.305836      0   1.00000 1.03609 0.17531
## 2 0.092743      1   0.69416 0.76907 0.12258
## 3 0.022539      2   0.60142 0.71698 0.11053
## 4 0.010000      3   0.57888 0.74908 0.11644

These results differ somewhat. Pages is still a significant variable, as is Fantasy. But now Math_Stats (indicating books that are about mathematics or statistics, one of my top genres from last year) also is. These are the variables used by the analysis to construct my regression tree. If we look at the full summary -

summary(tree1)
## Call:
## rpart(formula = My_Rating ~ Pages + Read_Time + Author_Gender + 
##     Fiction + Fantasy + Math_Stats + YA_Fic, data = books, method = "anova")
##   n= 53 
## 
##           CP nsplit rel error    xerror      xstd
## 1 0.30583640      0 1.0000000 1.0360856 0.1753070
## 2 0.09274251      1 0.6941636 0.7690729 0.1225752
## 3 0.02253938      2 0.6014211 0.7169813 0.1105294
## 4 0.01000000      3 0.5788817 0.7490758 0.1164386
## 
## Variable importance
##      Pages    Fantasy    Fiction     YA_Fic Math_Stats  Read_Time 
##         62         18          8          6          4          3 
## 
## Node number 1: 53 observations,    complexity param=0.3058364
##   mean=4.09434, MSE=0.3873265 
##   left son=2 (9 obs) right son=3 (44 obs)
##   Primary splits:
##       Pages         < 185 to the left,  improve=0.30583640, (0 missing)
##       Fiction       < 0.5 to the left,  improve=0.24974560, (0 missing)
##       Fantasy       < 0.5 to the left,  improve=0.20761810, (0 missing)
##       Math_Stats    < 0.5 to the right, improve=0.20371790, (0 missing)
##       Author_Gender < 0.5 to the right, improve=0.02705187, (0 missing)
## 
## Node number 2: 9 observations
##   mean=3.333333, MSE=0.2222222 
## 
## Node number 3: 44 observations,    complexity param=0.09274251
##   mean=4.25, MSE=0.2784091 
##   left son=6 (26 obs) right son=7 (18 obs)
##   Primary splits:
##       Fantasy    < 0.5 to the left,  improve=0.15541600, (0 missing)
##       Fiction    < 0.5 to the left,  improve=0.12827990, (0 missing)
##       Math_Stats < 0.5 to the right, improve=0.10487750, (0 missing)
##       Pages      < 391 to the left,  improve=0.05344995, (0 missing)
##       Read_Time  < 7.5 to the right, improve=0.04512078, (0 missing)
##   Surrogate splits:
##       Fiction   < 0.5 to the left,  agree=0.773, adj=0.444, (0 split)
##       YA_Fic    < 0.5 to the left,  agree=0.727, adj=0.333, (0 split)
##       Pages     < 370 to the left,  agree=0.682, adj=0.222, (0 split)
##       Read_Time < 3.5 to the right, agree=0.659, adj=0.167, (0 split)
## 
## Node number 6: 26 observations,    complexity param=0.02253938
##   mean=4.076923, MSE=0.2248521 
##   left son=12 (7 obs) right son=13 (19 obs)
##   Primary splits:
##       Math_Stats    < 0.5 to the right, improve=0.079145230, (0 missing)
##       Pages         < 364 to the left,  improve=0.042105260, (0 missing)
##       Fiction       < 0.5 to the left,  improve=0.042105260, (0 missing)
##       Read_Time     < 5.5 to the left,  improve=0.016447370, (0 missing)
##       Author_Gender < 0.5 to the right, improve=0.001480263, (0 missing)
## 
## Node number 7: 18 observations
##   mean=4.5, MSE=0.25 
## 
## Node number 12: 7 observations
##   mean=3.857143, MSE=0.4081633 
## 
## Node number 13: 19 observations
##   mean=4.157895, MSE=0.132964

we see that Fiction, YA_Fic, and Read_Time were also significant variables. The problem is that there is multicollinearity between Fiction, Fantasy, Math_Stats, and YA_Fic. All Fantasy and YA_Fic books are Fiction, while all Math_Stats books are not Fiction. And all YA_Fic books I read were Fantasy. This is probably why the tree didn't use Fiction or YA_Fic. I'm not completely clear on why Read_Time didn't end up in the regression tree, but it may be because my read time was pretty constant among the different splits and didn't add any new information to the tree. If I were presenting these results somewhere other than my blog, I'd probably want to do some follow-up analyses to confirm this fact.

Now the fun part: let's plot our regression tree:

plot(tree1, uniform = TRUE, main = "Regression Tree for My Goodreads Ratings")
text(tree1, use.n = TRUE, all = TRUE, cex = 0.8)
This tree shows that, before taking into account anything, my average book rating was 4.09. If a book is shorter than 185 pages (9 books in my dataset), it's average rating was 3.33. For longer books (44), the average rating was 4.25. But there's more to it than that. For non-Fantasy books (26), the average was 4.08, while the Fantasy (18 books) average was 4.5. If the book was Math_Stats (7), I gave it an average of 3.86, and if it was not Math_Stats (19), the average was 4.16. (Unfortunately, R cuts off the bottom part of the plot.)

While this plot is great for a quick visualization, I can make a nicer looking plot (which doesn't cut off the bottom text) as a PostScript file.

post(tree1, file = "mytree.ps",
        title = "Regression Tree for Rating")

I converted that to a PDF, which you can view here.

Hope you enjoyed this post! Have any readers used this technique before? Any thoughts or applications you'd like to share? (Self-promotion highly encouraged!)

Saturday, August 25, 2018

Happy Birthday, Leonard Bernstein

Today's Google Doodle celebrates the 100th birthday of Leonard Bernstein:


This year, multiple performing arts organizations have joined in the celebration by performing some of Bernstein's works. But, given that today is the big day, I feel like I should spend my time sitting and listening to some of my favorites.

If you're not sure where to begin, I highly recommend this list from Vox - 5 of Bernstein's musical masterpieces.

Wednesday, August 22, 2018

Loving the View

A friend of mine started a new job not long ago. Today, I got to check out his new office and the view from the 41st floor of his building. Here's the view looking South/Southeast:





The view to the north is pretty nice too:


And hey! I can see my office from here!


We couldn't have asked for a more beautiful day.

Tuesday, August 21, 2018

The Browns Suck - and They Have Math to Prove It

In September of 1999, the Cleveland Browns rebooted their franchise, giving them a default Elo rating of 1300 - and according to FiveThirtyEight, they're right back where they started:
If only the past 20 years of misfortune could be erased that easily. After the team lost its final game of the 2017 season to give the 2008 Detroit Lions some company in the 0-16 club, Cleveland fans held an ironic parade to “celebrate” the team’s anti-accomplishment. But it is perversely impressive to craft a pro football team so dreadful. No other team in the entire history of the NFL has ever suffered through a stretch of 31 losses in 32 tries like the Browns just did.

As noted above, in terms of Elo, Cleveland is the first modern expansion team to tumble back to square one after its first two decades in the league.

A central paradox rests at the core of the Browns’ struggles. Since returning to existence in 1999, Cleveland has enjoyed the most valuable collection of draft picks in the NFL, according to the Approximate Value we’d expect players selected in those slots to generate early in their careers. Yet the Browns have also been — by far — the worst drafting team in the league, in terms of the AV its picks have actually produced relative to those expectations.

Their unifying crime might be a penchant for all-or-nothing, quick-fix gambles. For instance, the Browns are infamous for their quixotic pursuit of the NFL draft’s biggest prize — the Franchise Quarterback™ — and they’ve burned through 29 different primary passers (including seven taken with first-round picks) since 1999 trying to find one.

Sunday, August 19, 2018

Statistics Sunday: Using Text Analysis to Become a Better Writer

Using Text Analysis to Become a Better Writer We all have words we love to use, and that we perhaps use too much. As an example: I have a tendency to use the same transitional statements, to the point that, before I submit a manuscript, I do a find all to see how many times I've used some of my favorites, e.g., additionally, though, and so on.

I'm sure we all have our own words we use way too often.


Text analysis can also be used to discover patterns in writing, and for a writer, may be helpful in discovering when we depend too much on certain words and phrases. For today's demonstration, I read in my (still in-progress) novel - a murder mystery called Killing Mr. Johnson - and did the same type of text analysis I've been demonstrating in recent posts.

To make things easier, I copied the document into a text file, and used the read_lines and tibble functions to prepare data for my analysis.

setwd("~/Dropbox/Writing/Killing Mr. Johnson")

library(tidyverse)
KMJ_text <- read_lines('KMJ_full.txt')

KMJ <- tibble(KMJ_text) %>%
  mutate(linenumber = row_number())

I kept my line numbers, which I could use in some future analysis. For now, I'm going to tokenize my data, drop stop words, and examine my most frequently used words.

library(tidytext)
KMJ_words <- KMJ %>%
  unnest_tokens(word, KMJ_text) %>%
  anti_join(stop_words)
## Joining, by = "word"
KMJ_words %>%
  count(word, sort = TRUE) %>%
  filter(n > 75) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() + xlab(NULL) + coord_flip()


Fortunately, my top 5 words are the names of the 5 main characters, with the star character at number 1: Emily is named almost 600 times in the book. It's a murder mystery, so I'm not too surprised that words like "body" and "death" are also common. But I know that, in my fiction writing, I often depend on a word type that draws a lot of disdain from authors I admire: adverbs. Not all adverbs, mind you, but specifically (pun intended) the "-ly adverbs."

ly_words <- KMJ_words %>%
  filter(str_detect(word, ".ly")) %>%
  count(word, sort = TRUE)

head(ly_words)
## # A tibble: 6 x 2
##   word         n
##   <chr>    <int>
## 1 emily      599
## 2 finally     80
## 3 quickly     60
## 4 emily’s     53
## 5 suddenly    39
## 6 quietly     38

Since my main character is named Emily, she was accidentally picked up by my string detect function. A few other top words also pop up in the list that aren't actually -ly adverbs. I'll filter those out then take a look at what I have left.

filter_out <- c("emily", "emily's", "emily’s","family", "reply", "holy")

ly_words <- ly_words %>%
  filter(!word %in% filter_out)

ly_words %>%
  filter(n > 10) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() + xlab(NULL) + coord_flip()


I use "finally", "quickly", and "suddenly" far too often. "Quietly" is also up there. I think the reason so many writers hate on adverbs is because it can encourage lazy writing. You might write that someone said something quietly or softly, but is there a better word? Did they whisper? Mutter? Murmur? Hiss? Did someone "move quickly" or did they do something else - run, sprint, dash?

At the same time, sometimes adverbs are necessary. I mean, can I think of a complete sentence that only includes an adverb? Definitely. Still, it might become tedious if I keep depending on the same words multiple times, and when a fiction book (or really any kind of writing) is tedious, we often give up. These results give me some things to think about as I edit.

Still have some big plans on the horizon, including some new statistics videos, a redesigned blog, and more surprises later! Thanks for reading!

Friday, August 17, 2018

Topics and Categories in the Russian Troll Tweets

Topics and Categories in the Russian Troll Tweets I decided to return to the analysis I conducted for the IRA tweets dataset. (You can read up on that analysis and R code here.) Specifically, I returned to the LDA results, which looked like they lined up pretty well with the account categories identified by Darren Linvill and Patrick Warren. But with slightly altered code, we can confirm that or see if there's more to the topics data than meets the eye. (Spoiler alert: There is more than meets the eye.)

I reran much of the original code - creating the file, removing non-English tweets and URLs, generating the DTM and conducting the 6-topic LDA. For brevity, I'm not including it in this post, but once again, you can see it here.

I will note that the topics were numbered a bit differently than they were in my previous analysis. Here's the new plot. The results look very similar to before. (LDA is a variational Bayesian method and there is an element of randomness to it, so the results aren't a one-to-one match, but they're very close.)

top_terms <- tweet_topics %>%
  group_by(topic) %>%
  top_n(15, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~topic, scales = "free") +
  coord_flip()

Before, when I generated a plot of the LDA results, I asked it to give me the top 15 terms by topic. I'll use the same code, but instead have it give the top topic for each term.

word_topic <- tweet_topics %>%
  group_by(term) %>%
  top_n(1, beta) %>%
  ungroup()

I can then match this dataset up to the original tweetwords dataset, to show which topic each word is most strongly associated with. Because the word variable is known by two different variable names in my datasets, I need to tell R how to match.

tweetwords <- tweetwords %>%
  left_join(word_topic, by = c("word" = "term"))

Now we can generate a crosstable, which displays the matchup between LDA topic (1-6) and account category (Commercial, Fearmonger, Hashtag Gamer, Left Troll, News Feed, Right Troll, and Unknown).

cat_by_topic <- table(tweetwords$account_category, tweetwords$topic)
cat_by_topic
##               
##                      1       2       3       4       5       6
##   Commercial     38082   34181   49625  952309   57744   19380
##   Fearmonger      9187    3779   37326    1515    8321    4864
##   HashtagGamer  117517  103628  183204   31739  669976   81803
##   LeftTroll     497796 1106698  647045   94485  395972  348725
##   NewsFeed     2715106  331987  525710   91164  352709  428937
##   RightTroll    910965  498983 1147854  113829  534146 2420880
##   Unknown         7622    5198   12808    1497   11282    4605

This table is a bit hard to read, because it's frequencies, and the total number of words for each topic and account category differ. But we can solve that problem by asking instead for proportions. I'll have it generate proportions by column, so we can see the top account category associated with each topic.

options(scipen = 999)
prop.table(cat_by_topic, 2) #column percentages - which topic is each category most associated with
##               
##                          1           2           3           4           5
##   Commercial   0.008863958 0.016398059 0.019060352 0.740210550 0.028443218
##   Fearmonger   0.002138364 0.001812945 0.014336458 0.001177579 0.004098712
##   HashtagGamer 0.027353230 0.049714697 0.070366404 0.024670084 0.330013053
##   LeftTroll    0.115866885 0.530929442 0.248522031 0.073441282 0.195045686
##   NewsFeed     0.631967460 0.159268087 0.201918749 0.070859936 0.173735438
##   RightTroll   0.212036008 0.239383071 0.440876611 0.088476982 0.263106667
##   Unknown      0.001774095 0.002493699 0.004919395 0.001163588 0.005557225
##               
##                          6
##   Commercial   0.005856411
##   Fearmonger   0.001469844
##   HashtagGamer 0.024719917
##   LeftTroll    0.105380646
##   NewsFeed     0.129619781
##   RightTroll   0.731561824
##   Unknown      0.001391578

Category 1 is News Feed, Category 2 Left Troll, Category 4 Commercial, and Category 5 Hashtag Gamer. But look at Categories 3 and 6. For both, the highest percentage is Right Troll. Fearmonger is not most strongly associated with any specific topic. What happens if we instead ask for a proportion table by row, which tells us which category each topic most associated with?

prop.table(cat_by_topic, 1) #row percentages - which category is each topic most associated with
##               
##                         1          2          3          4          5
##   Commercial   0.03307679 0.02968851 0.04310266 0.82714465 0.05015456
##   Fearmonger   0.14135586 0.05814562 0.57431684 0.02331056 0.12803114
##   HashtagGamer 0.09893111 0.08723872 0.15422939 0.02671932 0.56401601
##   LeftTroll    0.16106145 0.35807114 0.20935083 0.03057054 0.12811638
##   NewsFeed     0.61073827 0.07467744 0.11825366 0.02050651 0.07933866
##   RightTroll   0.16190164 0.08868197 0.20400284 0.02023031 0.09493132
##   Unknown      0.17720636 0.12085000 0.29777736 0.03480424 0.26229889
##               
##                         6
##   Commercial   0.01683284
##   Fearmonger   0.07483998
##   HashtagGamer 0.06886545
##   LeftTroll    0.11282966
##   NewsFeed     0.09648546
##   RightTroll   0.43025192
##   Unknown      0.10706315

Based on these results, Fearmonger now seems closest to Category 3 and Right Troll with Category 6. But Right Troll also shows up on Categories 3 (20%) and 1 (16%). Left Trolls show up in these categories at nearly exact proportions. It appears, then, that political trolls show strong similarity in topics with Fearmongers (stirring things up) and News Feed ("informing") trolls. Unknown isn't the top contributer to any topic, but it aligns with Categories 3 (showing elements of Fearmongering) and 5 (showing elements of Hashtag Gaming). Let's focus in on 5 categories.

categories <- c("Fearmonger", "HashtagGamer", "LeftTroll", "NewsFeed", "RightTroll")

politics_fear_hash <- tweetwords %>%
  filter(account_category %in% categories)

PFH_counts <- politics_fear_hash %>%
  count(account_category, topic, word, sort = TRUE) %>%
  ungroup()

For now, let's define our topics like this: 1 = News Feed, 2 = Left Troll, 3 = Fearmonger, 4 = Commercial, 5 = Hashtag Gamer, and 6 = Right Troll. We'll ask R to go through our PFH dataset and tell us when account category topic matches and when it mismatches. Then we can look at these terms.

PFH_counts$match <- ifelse(PFH_counts$account_category == "NewsFeed" & PFH_counts$topic == 1,PFH_counts$match <- "Match",
                           ifelse(PFH_counts$account_category == "LeftTroll" & PFH_counts$topic == 2,PFH_counts$match <- "Match",
                                  ifelse(PFH_counts$account_category == "Fearmonger" & PFH_counts$topic == 3,PFH_counts$match <- "Match",
                                         ifelse(PFH_counts$account_category == "HashtagGamer" & PFH_counts$topic == 5,PFH_counts$match <- "Match",
                                                ifelse(PFH_counts$account_category == "RightTroll" & PFH_counts$topic == 6,PFH_counts$match <- "Match",
                                                       PFH_counts$match <- "NonMatch")))))

top_PFH <- PFH_counts %>%
  group_by(account_category, match) %>%
  top_n(15, n) %>%
  ungroup() %>%
  arrange(account_category, -n)

top_PFH %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = factor(match))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~account_category, scales = "free") +
  coord_flip()

Red indicates a match and blue indicates a mismatch. So when Fearmongers talk about food poisoning or Koch Farms, it's a match, but when they talk about Hillary Clinton or the police, it's a mismatch. Terms like "MAGA" and "CNN" are matches for Right Trolls but "news" and "love" are mismatches. Left Trolls show a match when tweeting about "Black Lives Matter" or "police" but a mismatch when tweeting about "Trump" or "America." An interesting observation is that Trump is a mismatch for every topic it's displayed under on the plot. (Now, realdonaldtrump, Trump's Twitter handle, is a match for Right Trolls.) So where does that term, and associated terms like "Donald", belong?

tweetwords %>%
  filter(word %in% c("donald", "trump"))
## # A tibble: 157,844 x 7
##    author publish_date     account_category id         word  topic    beta
##    <chr>  <chr>            <chr>            <chr>      <chr> <int>   <dbl>
##  1 10_GOP 10/1/2017 22:43  RightTroll       C:/Users/~ trump     3 0.0183 
##  2 10_GOP 10/1/2017 23:52  RightTroll       C:/Users/~ trump     3 0.0183 
##  3 10_GOP 10/1/2017 2:47   RightTroll       C:/Users/~ dona~     3 0.00236
##  4 10_GOP 10/1/2017 2:47   RightTroll       C:/Users/~ trump     3 0.0183 
##  5 10_GOP 10/1/2017 3:47   RightTroll       C:/Users/~ trump     3 0.0183 
##  6 10_GOP 10/10/2017 20:57 RightTroll       C:/Users/~ trump     3 0.0183 
##  7 10_GOP 10/10/2017 23:42 RightTroll       C:/Users/~ trump     3 0.0183 
##  8 10_GOP 10/11/2017 22:14 RightTroll       C:/Users/~ trump     3 0.0183 
##  9 10_GOP 10/11/2017 22:20 RightTroll       C:/Users/~ trump     3 0.0183 
## 10 10_GOP 10/12/2017 0:38  RightTroll       C:/Users/~ trump     3 0.0183 
## # ... with 157,834 more rows

These terms apparently were sorted into Category 3, which we've called Fearmongers. Once again, this highlights the similarity between political trolls and fearmongering trolls in this dataset.

Diagramming Sentences

Confession: I never learned to diagram sentences.

Second confession: I'm really upset about this fact.

Solution: This great article taught me how to do it!

Last confession: This GIF came up when I searched for "Happy nerd"

Thursday, August 16, 2018

Finding Strengths

As part of my job and newly reorganized departments, my boss had many us take the Clifton StrengthsFinder, a measure developed by Donald O. Clifton and Gallup. This measure, developed through semi-structured interviews and subsequent psychometric research, identifies an individual's top 5 strengths from a list of 34. Here are my results:


The book that comes along with the assessment describes the 34 themes in detail and gives very basic information on the measure's development. But for the psychometricially inclined, you can read a detailed technical report of the measure's evidence for reliability and validity here. In general, the measure shows acceptable reliability and construct validity. There are moderate to strong correlations with the Big Five Personality Traits. My themes, specifically, relate to my high Agreeableness and Openness to Experience on the Big Five. (For comparison, here are my Myers-Briggs results.)

The report also talks about how the themes relate to leadership potential. What I'm best at, according to these results, are Relationship Building and Strategic Thinking.

And, of course, I always enjoy taking tests and measures, especially if I think they'll tell me something about myself.

Sunday, August 12, 2018

Statistics Sunday: Getting Started with the Russian Tweet Dataset

IRA Tweet Data You may have heard that two researchers at Clemson University analyzed almost 3 millions tweets from the Internet Research Agency (IRA) - a "Russian troll factory". In partnership with FiveThirtyEight, they made all of their data available on GitHub. So of course, I had to read the files into R, which I was able to do with this code:

files <- c("IRAhandle_tweets_1.csv",
           "IRAhandle_tweets_2.csv",
           "IRAhandle_tweets_3.csv",
           "IRAhandle_tweets_4.csv",
           "IRAhandle_tweets_5.csv",
           "IRAhandle_tweets_6.csv",
           "IRAhandle_tweets_7.csv",
           "IRAhandle_tweets_8.csv",
           "IRAhandle_tweets_9.csv")
my_files <- paste0("~/Downloads/russian-troll-tweets-master/",files)

each_file <- function(file) {
  tweet <- read_csv(file) }

library(tidyverse)
tweet_data <- NULL
for (file in my_files) {
  temp <- each_file(file)
  temp$id <- sub(".csv", "", file)
  tweet_data <- rbind(tweet_data, temp)
}

Note that this is a large file, with 2,973,371 observations of 16 variables. Let's do some cleaning of this dataset first. The researchers, Darren Linvill and Patrick Warren, identified 5 majors types of trolls:
  • Right Troll: These Trump-supporting trolls voiced right-leaning, populist messages, but “rarely broadcast traditionally important Republican themes, such as taxes, abortion, and regulation, but often sent divisive messages about mainstream and moderate Republicans…They routinely denigrated the Democratic Party, e.g. @LeroyLovesUSA, January 20, 2017, “#ThanksObama We're FINALLY evicting Obama. Now Donald Trump will bring back jobs for the lazy ass Obamacare recipients,” the authors wrote.
  • Left Troll: These trolls mainly supported Bernie Sanders, derided mainstream Democrats, and focused heavily on racial identity, in addition to sexual and religious identity. The tweets were “clearly trying to divide the Democratic Party and lower voter turnout,” the authors told FiveThirtyEight.
  • News Feed: A bit more mysterious, news feed trolls mostly posed as local news aggregators who linked to legitimate news sources. Some, however, “tweeted about global issues, often with a pro-Russia perspective.”
  • Hashtag Gamer: Gamer trolls used hashtag games—a popular call/response form of tweeting—to drum up interaction from other users. Some tweets were benign, but many “were overtly political, e.g. @LoraGreeen, July 11, 2015, “#WasteAMillionIn3Words Donate to #Hillary.”
  • Fearmonger: These trolls, who were least prevalent in the dataset, spread completely fake news stories, for instance “that salmonella-contaminated turkeys were produced by Koch Foods, a U.S. poultry producer, near the 2015 Thanksgiving holiday.”
But a quick table of the results of the variable, account_category, shows 8 in the dataset.

table(tweet_data$account_category)
## 
##   Commercial   Fearmonger HashtagGamer    LeftTroll     NewsFeed 
##       122582        11140       241827       427811       599294 
##   NonEnglish   RightTroll      Unknown 
##       837725       719087        13905

The additional three are Commercial, Non-English, and Unknown. At the very least, we should drop the Non-English tweets, since those use Russian characters and any analysis I do will assume data are in English. I'm also going to keep only a few key variables. Then I'm going to clean up this dataset to remove links, because I don't need those for my analysis - I certainly wouldn't want to follow them to their destination. If I want to free up some memory, I can then remove the large dataset.

reduced <- tweet_data %>%
  select(author,content,publish_date,account_category) %>%
  filter(account_category != "NonEnglish")

library(qdapRegex)
## 
## Attaching package: 'qdapRegex'
reduced$content <- rm_url(reduced$content)

rm(tweet_data)

Now we have a dataset of 2,135,646 observations of 4 variables. I'm planning on doing some analysis on my own of this dataset - and will of course share what I find - but for now, I thought I'd repeat a technique I've covered on this blog and demonstrate a new one.

library(tidytext)

tweetwords <- reduced %>%
  unnest_tokens(word, content) %>%
  anti_join(stop_words)
## Joining, by = "word"
wordcounts <- tweetwords %>%
  count(account_category, word, sort = TRUE) %>%
  ungroup()

head(wordcounts)
## # A tibble: 6 x 3
##   account_category word          n
##   <chr>            <chr>     <int>
## 1 NewsFeed         news     124586
## 2 RightTroll       trump     95794
## 3 RightTroll       rt        86970
## 4 NewsFeed         sports    47793
## 5 Commercial       workout   42395
## 6 NewsFeed         politics  38204

First, I'll conduct a TF-IDF analysis of the dataset. This code is a repeat from a previous post.

tweet_tfidf <- wordcounts %>%
  bind_tf_idf(word, account_category, n) %>%
  arrange(desc(tf_idf))

tweet_tfidf %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>%
  group_by(account_category) %>%
  top_n(15) %>%
  ungroup() %>%
  ggplot(aes(word, tf_idf, fill = account_category)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~account_category, ncol = 2, scales = "free") +
  coord_flip()
## Selecting by tf_idf


But another method of examining terms and topics in a set of documents is Latent Dirichlet Allocation (LDA), which can be conducted using the R package, topicmodels. The only issue is that LDA requires a document term matrix. But we can easily convert our wordcounts dataset into a DTM with the cast_dtm function from tidytext. Then we run our LDA with topicmodels. Note that LDA is a random technique, so we set a random number seed, and we specify how many topics we want the LDA to extract (k). Since there are 6 account types (plus 1 unknown), I'm going to try having it extract 6 topics. We can see how well they line up with the account types.

tweets_dtm <- wordcounts %>%
  cast_dtm(account_category, word, n)

library(topicmodels)
tweets_lda <- LDA(tweets_dtm, k = 6, control = list(seed = 42))
tweet_topics <- tidy(tweets_lda, matrix = "beta")

Now we can pull out the top terms from this analysis, and plot them to see how they lined up.

top_terms <- tweet_topics %>%
  group_by(topic) %>%
  top_n(15, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~topic, scales = "free") +
  coord_flip()


Based on these plots, I'd say the topics line up very well with the account categories, showing, in order: news feed, left troll, fear monger, right troll, hash gamer, and commercial. One interesting observation, though, is that Trump is a top term in 5 of the 6 topics.

Wednesday, August 8, 2018

Cowboy Bebop 20 Years Later

I've been asked before what made me fall in love with corgis. Did I idolize British royalty? Did I have one or know someone who had one as a child? Did I just think their stumpy legs and enormous ears were adorable? (Well, yeah.)

But the real reason? I started watching Cowboy Bebop in college and fell in love with Ein, the adorable corgi on the show. Yes, I love corgis because I loved this anime.


For anyone who has seen this anime, all I have to do is say I love Cowboy Bebop and they furiously nod in agreement, perhaps singing some of their favorite songs from the show or doing their best Spike Spiegel impression. For anyone who hasn't seen it, it's difficult to describe the show because it defies categorization. But this article does a wonderful job of summing it up:
Cowboy Bebop is its own thing. You could call it a space western. That’s, however, like saying BeyoncĂ© is a good dancer. It’s true. But it’s not the whole picture. Not even close. Cowboy Bebop is a mashup of noir films, spaghetti westerns, urban thrillers, Kurosawa samurai films, classic westerns and sci-fi space adventures.

Again, though, all that genre bending is just part of the equation. The show resonates so deeply because it’s a mirror in which you can see yourself, and how we all wrestle with life. This is what makes Cowboy Bebop great art. It’s a beautifully complex, aesthetically striking meditation on how we deal with love, loss, luck and that inescapable question: Why should I give a fuck?
It's been 20 years since Cowboy Bebop premiered, and those 26 episodes and 1 movie remain relevant and influential today. Time to rewatch the show, I think.

Tuesday, August 7, 2018

Statistics Sunday: Highlighting a Subset of Data in ggplot2

Highlighting Specific Cases in ggplot2 Here's my belated Statistics Sunday post, using a cool technique I just learned about: gghighlight. This R package works with ggplot2 to highlight a subset of data. To demonstrate, I'll use a dataset I analyzed for a previous post about my 2017 reading habits. [Side note: My reading goal for this year is 60 books, and I'm already at 43! I may have to increase my goal at some point.]

setwd("~/R")
library(tidyverse)
books<-read_csv("2017_books.csv", col_names = TRUE)
## Warning: Duplicated column names deduplicated: 'Author' => 'Author_1' [13]
## Parsed with column specification:
## cols(
##   .default = col_integer(),
##   Title = col_character(),
##   Author = col_character(),
##   G_Rating = col_double(),
##   Started = col_character(),
##   Finished = col_character()
## )
## See spec(...) for full column specifications.

One analysis I conducted with this dataset was to look at the correlation between book length (number of pages) and read time (number of days it took to read the book). We can also generate a scatterplot to visualize this relationship.

cor.test(books$Pages, books$Read_Time)
## 
## 	Pearson's product-moment correlation
## 
## data:  books$Pages and books$Read_Time
## t = 3.1396, df = 51, p-value = 0.002812
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1482981 0.6067498
## sample estimates:
##       cor 
## 0.4024597
scatter <- ggplot(books, aes(Pages, Read_Time)) +
  geom_point(size = 3) +
  theme_classic() +
  labs(title = "Relationship Between Reading Time and Page Length") +
  ylab("Read Time (in days)") +
  xlab("Number of Pages") +
  theme(legend.position="none",plot.title=element_text(hjust=0.5))

There's a significant positive correlation here, meaning the longer books take more days to read. It's a moderate correlation, and there are certainly other variables that may explain why a book took longer to read. For instance, nonfiction books may take longer. Books read in October or November (while I was gearing up for and participating in NaNoWriMo, respectively) may also take longer, since I had less spare time to read. I can conduct regressions and other analyses to examine which variables impact read time, but one of the most important parts of sharing results is creating good data visualizations. How can I show the impact these other variables have on read time in an understandable and visually appealing way?

gghighlight will let me draw attention to different parts of the plot. For example, I can ask gghighlight to draw attention to books that took longer than a certain amount of time to read, and I can even ask it to label those books.

library(gghighlight)
scatter + gghighlight(Read_Time > 14) +
  geom_label(aes(label = Title),
             hjust = 1,
             vjust = 1,
             fill = "blue",
             color = "white",
             alpha = 0.5)


Here, the gghighlight function identifies the subset (books that took more than 2 weeks to read) and labels those books with the Title variable. Three of the four books with long read time values are non-fiction, and one was read for a course I took, so reading followed a set schedule. But the fourth is a fiction book, which took over 20 days to read. Let's see how month impacts reading time, by highlighting books read in November. To do that, I'll need to alter my dataset somewhat. The dataset contains a starting date and finish date, which were read in as characters. I need to convert those to dates and pull out the month variable to create my indicator.

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
books$Started <- mdy(books$Started)
books$Start_Month <- month(books$Started)
books$Month <- ifelse(books$Start_Month > 10 & books$Start_Month < 12, books$Month <- 1,
                      books$Month <- 0)
scatter + gghighlight(books$Month == 1) +
  geom_label(aes(label = Title), hjust = 1, vjust = 1, fill = "blue", color = "white", alpha = 0.5)


The book with the longest read time was, in fact, read during November, when I was spending most of my time writing.

Monday, August 6, 2018

Fun Weekend

I had a full, fun weekend, hence no Statistics Sunday post.

Saturday, I went to the Bristol Renaissance Faire - the theme was magic and monsters, so I dressed in shades of blue (water) and perched my knitted Loch Ness Monster on my shoulder. Only a handful of people got my full costume but I still got lots of compliments on and reactions to Nessie.


We held out almost 6 hours in the 95+ degree heat - LOTS OF WATER. While it may have been silly to go to Faire on such a hot day, I was happy at least that it wasn't raining. And between sunblock and my parasol, I made it home with no sunburns!

After, we dropped by Mars Cheese Castle to cool off and stock up on snacks, meat, and cheese, and I picked up a couple bottles of beer to take home.

Sunday I had some friends over in the afternoon. Their son is really into trains, and a train line runs right behind my house. The back gate opens up right onto the tracks, so we have a great view. After chilling on the grass for a shockingly long time waiting for trains (they usually run almost constantly on Sundays but not yesterday for some reason), we got lucky and had a Metra followed immediately by a freight train.

Last night, Endless Summer wrapped up in La Grange just a few blocks from me, so I watched the fireworks display from the comfort of my place.

Posting may continue to be sporadic for a bit longer, but stay tuned.

Friday, August 3, 2018

Stats Note: Making Sense of Open-Ended Responses with Text Analysis

Using Text Mining on Open Ended Items Good survey design is both art and science. You have to think about how people will read and process your questions, and what sorts of responses might result from different question forms and wording. One of the big rules I follow in survey design is that you don't assess any of your most important topics with an open-ended item. Most people skip them, because they're more work than selecting options from a list, and people who do complete may give you terse, unhelpful, or gibberish answers.

When I was working on a large survey among people with spinal cord injuries/disorders, the survey designers decided to assess the exact details of the respondent's spinal injury or disorder with an open-ended question so that people could describe it "in their own words." As you might have guessed, the data were mostly meaningless and as such unusable. But many hypotheses and research questions dealt with looking at differences between people who sustained an injury and those with a disorder affecting their spinal cord. We couldn't even begin to test or examine any of those in the course of our study. It was unbelievably frustrating, because we could have gotten the information we needed with a single question and some categorical responses. We could have then asked people to supplement their answer with the open-ended question. Most would skip it, but we'd have the data we needed to answer our questions.

In my job, I inherited the results of a large survey involving a variety of dental professional populations. Once again, certain items that could have been addressed with a few close-ended questions were instead open-ended questions and not many of the responses are useful. The item that inspired this blog post assessed the types of products dental assistants are involved in purchasing, which can include anything from office supplies to personal protective equipment to large equipment (X-ray machine, etc.). Everyone had a different way of articulating what they were involved with purchasing, some simply saying "all dental supplies" or "everything," while others gave more specific details. In total, about 400 people responded to this item, which is a lot of data to dig through. But thanks to my new experience with text mining in R, I was able to try to make some sense of responses. Mind you, a lot of the responses can't be understood, but it's at least something.

I decided I could probably begin to categorize responses by slicing the responses up into the individual words. I can generate counts overall as well as by respondent, and use this to begin examining and categorizing the data. Finally, if I need some additional context, I can ask R to give me all responses that contain a certain word.

Because I don't own these data, I can't share them on my blog. But I can demonstrate with different text data to show what I did. To do this, I'll use one of my all-time favorite books, The Wonderful Wizard of Oz by L. Frank Baum, which is available through Project Gutenberg. The gutenbergr package will let me download the full-text.

library(gutenbergr)
gutenberg_works(title == "The Wonderful Wizard of Oz")
## # A tibble: 1 x 8
##   gutenberg_id title   author   gutenberg_autho~ language gutenberg_books~
##          <int> <chr>   <chr>               <int> <chr>    <chr>           
## 1           55 The Wo~ Baum, L~               42 en       Children's Lite~
## # ... with 2 more variables: rights <chr>, has_text <lgl>

Now we have the gutenberg_id, which will be used to download the fulltext into a data frame.

WOz <- gutenberg_download(gutenberg_id = 55)
## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
WOz$line <- row.names(WOz)

To start exploring the data, I'll need to use the tidytext package to unnest tokens (words) and remove stop words that don't tell us much.

library(tidyverse)
library(tidytext)
tidy_oz <- WOz %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)
## Joining, by = "word"

Now I can begin to explore my "responses" by generating overall counts and even counts by respondent. (For the present example, I'll use my line number variable instead of the respondent id number I used in the actual dataset.)

word_counts <- tidy_oz %>%
  count(word, sort = TRUE)
resp_counts <- tidy_oz %>%
  count(line, word, sort = TRUE) %>%
  ungroup()

When I look at my overall counts, I see that the most frequently used words are the main characters of the book. So from this, I could generate a category "characters." Words like "Kansas", "Emerald" and "City" are also common, and I could create a category called "places." Finally, "heart" and "brains" are common - a category could be created to encompass what the characters are seeking. Obviously, this might not be true for every instance. It could be that someone "didn't have the heart" to tell someone something. I can try to separate out those instances by looking at the original text.

heart <- WOz[grep("heart",WOz$text),]
head(heart)
## # A tibble: 6 x 3
##   gutenberg_id text                                                  line 
##          <int> <chr>                                                 <chr>
## 1           55 happiness to childish hearts than all other human cr~ 48   
## 2           55 the heartaches and nightmares are left out.           62   
## 3           55 and press her hand upon her heart whenever Dorothy's~ 110  
## 4           55 strange people.  Her tears seemed to grieve the kind~ 375  
## 5           55 Dorothy ate a hearty supper and was waited upon by t~ 517  
## 6           55 She ate a hearty breakfast, and watched a wee Munchk~ 545

Unfortunately, this got me any line containing a word starting with "heart" like "hearty" and "heartache." Let's rewrite that command to request an exact match.

heart <- WOz[grep("\\bheart\\b",WOz$text),]
head(heart)
## # A tibble: 6 x 3
##   gutenberg_id text                                                  line 
##          <int> <chr>                                                 <chr>
## 1           55 and press her hand upon her heart whenever Dorothy's~ 110  
## 2           55 "\"Do you suppose Oz could give me a heart?\""        935  
## 3           55 brains, and a heart also; so, having tried them both~ 975  
## 4           55 "rather have a heart.\""                              976  
## 5           55 grew to love her with all my heart.  She, on her par~ 992  
## 6           55 now no heart, so that I lost all my love for the Mun~ 1024

Now I can use this additional context to determine which instances of "heart" are about the Tin Man's quest.

It seems like these tools would be most useful for open-ended responses that are still very categorical, but it could help with determining themes for more narrative data.