Saturday, August 25, 2018

Happy Birthday, Leonard Bernstein

Today's Google Doodle celebrates the 100th birthday of Leonard Bernstein:


This year, multiple performing arts organizations have joined in the celebration by performing some of Bernstein's works. But, given that today is the big day, I feel like I should spend my time sitting and listening to some of my favorites.

If you're not sure where to begin, I highly recommend this list from Vox - 5 of Bernstein's musical masterpieces.

Wednesday, August 22, 2018

Loving the View

A friend of mine started a new job not long ago. Today, I got to check out his new office and the view from the 41st floor of his building. Here's the view looking South/Southeast:





The view to the north is pretty nice too:


And hey! I can see my office from here!


We couldn't have asked for a more beautiful day.

Tuesday, August 21, 2018

The Browns Suck - and They Have Math to Prove It

In September of 1999, the Cleveland Browns rebooted their franchise, giving them a default Elo rating of 1300 - and according to FiveThirtyEight, they're right back where they started:
If only the past 20 years of misfortune could be erased that easily. After the team lost its final game of the 2017 season to give the 2008 Detroit Lions some company in the 0-16 club, Cleveland fans held an ironic parade to “celebrate” the team’s anti-accomplishment. But it is perversely impressive to craft a pro football team so dreadful. No other team in the entire history of the NFL has ever suffered through a stretch of 31 losses in 32 tries like the Browns just did.

As noted above, in terms of Elo, Cleveland is the first modern expansion team to tumble back to square one after its first two decades in the league.

A central paradox rests at the core of the Browns’ struggles. Since returning to existence in 1999, Cleveland has enjoyed the most valuable collection of draft picks in the NFL, according to the Approximate Value we’d expect players selected in those slots to generate early in their careers. Yet the Browns have also been — by far — the worst drafting team in the league, in terms of the AV its picks have actually produced relative to those expectations.

Their unifying crime might be a penchant for all-or-nothing, quick-fix gambles. For instance, the Browns are infamous for their quixotic pursuit of the NFL draft’s biggest prize — the Franchise Quarterback™ — and they’ve burned through 29 different primary passers (including seven taken with first-round picks) since 1999 trying to find one.

Sunday, August 19, 2018

Statistics Sunday: Using Text Analysis to Become a Better Writer

Using Text Analysis to Become a Better Writer We all have words we love to use, and that we perhaps use too much. As an example: I have a tendency to use the same transitional statements, to the point that, before I submit a manuscript, I do a find all to see how many times I've used some of my favorites, e.g., additionally, though, and so on.

I'm sure we all have our own words we use way too often.


Text analysis can also be used to discover patterns in writing, and for a writer, may be helpful in discovering when we depend too much on certain words and phrases. For today's demonstration, I read in my (still in-progress) novel - a murder mystery called Killing Mr. Johnson - and did the same type of text analysis I've been demonstrating in recent posts.

To make things easier, I copied the document into a text file, and used the read_lines and tibble functions to prepare data for my analysis.

setwd("~/Dropbox/Writing/Killing Mr. Johnson")

library(tidyverse)
KMJ_text <- read_lines('KMJ_full.txt')

KMJ <- tibble(KMJ_text) %>%
  mutate(linenumber = row_number())

I kept my line numbers, which I could use in some future analysis. For now, I'm going to tokenize my data, drop stop words, and examine my most frequently used words.

library(tidytext)
KMJ_words <- KMJ %>%
  unnest_tokens(word, KMJ_text) %>%
  anti_join(stop_words)
## Joining, by = "word"
KMJ_words %>%
  count(word, sort = TRUE) %>%
  filter(n > 75) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() + xlab(NULL) + coord_flip()


Fortunately, my top 5 words are the names of the 5 main characters, with the star character at number 1: Emily is named almost 600 times in the book. It's a murder mystery, so I'm not too surprised that words like "body" and "death" are also common. But I know that, in my fiction writing, I often depend on a word type that draws a lot of disdain from authors I admire: adverbs. Not all adverbs, mind you, but specifically (pun intended) the "-ly adverbs."

ly_words <- KMJ_words %>%
  filter(str_detect(word, ".ly")) %>%
  count(word, sort = TRUE)

head(ly_words)
## # A tibble: 6 x 2
##   word         n
##   <chr>    <int>
## 1 emily      599
## 2 finally     80
## 3 quickly     60
## 4 emily’s     53
## 5 suddenly    39
## 6 quietly     38

Since my main character is named Emily, she was accidentally picked up by my string detect function. A few other top words also pop up in the list that aren't actually -ly adverbs. I'll filter those out then take a look at what I have left.

filter_out <- c("emily", "emily's", "emily’s","family", "reply", "holy")

ly_words <- ly_words %>%
  filter(!word %in% filter_out)

ly_words %>%
  filter(n > 10) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() + xlab(NULL) + coord_flip()


I use "finally", "quickly", and "suddenly" far too often. "Quietly" is also up there. I think the reason so many writers hate on adverbs is because it can encourage lazy writing. You might write that someone said something quietly or softly, but is there a better word? Did they whisper? Mutter? Murmur? Hiss? Did someone "move quickly" or did they do something else - run, sprint, dash?

At the same time, sometimes adverbs are necessary. I mean, can I think of a complete sentence that only includes an adverb? Definitely. Still, it might become tedious if I keep depending on the same words multiple times, and when a fiction book (or really any kind of writing) is tedious, we often give up. These results give me some things to think about as I edit.

Still have some big plans on the horizon, including some new statistics videos, a redesigned blog, and more surprises later! Thanks for reading!

Friday, August 17, 2018

Topics and Categories in the Russian Troll Tweets

Topics and Categories in the Russian Troll Tweets I decided to return to the analysis I conducted for the IRA tweets dataset. (You can read up on that analysis and R code here.) Specifically, I returned to the LDA results, which looked like they lined up pretty well with the account categories identified by Darren Linvill and Patrick Warren. But with slightly altered code, we can confirm that or see if there's more to the topics data than meets the eye. (Spoiler alert: There is more than meets the eye.)

I reran much of the original code - creating the file, removing non-English tweets and URLs, generating the DTM and conducting the 6-topic LDA. For brevity, I'm not including it in this post, but once again, you can see it here.

I will note that the topics were numbered a bit differently than they were in my previous analysis. Here's the new plot. The results look very similar to before. (LDA is a variational Bayesian method and there is an element of randomness to it, so the results aren't a one-to-one match, but they're very close.)

top_terms <- tweet_topics %>%
  group_by(topic) %>%
  top_n(15, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~topic, scales = "free") +
  coord_flip()

Before, when I generated a plot of the LDA results, I asked it to give me the top 15 terms by topic. I'll use the same code, but instead have it give the top topic for each term.

word_topic <- tweet_topics %>%
  group_by(term) %>%
  top_n(1, beta) %>%
  ungroup()

I can then match this dataset up to the original tweetwords dataset, to show which topic each word is most strongly associated with. Because the word variable is known by two different variable names in my datasets, I need to tell R how to match.

tweetwords <- tweetwords %>%
  left_join(word_topic, by = c("word" = "term"))

Now we can generate a crosstable, which displays the matchup between LDA topic (1-6) and account category (Commercial, Fearmonger, Hashtag Gamer, Left Troll, News Feed, Right Troll, and Unknown).

cat_by_topic <- table(tweetwords$account_category, tweetwords$topic)
cat_by_topic
##               
##                      1       2       3       4       5       6
##   Commercial     38082   34181   49625  952309   57744   19380
##   Fearmonger      9187    3779   37326    1515    8321    4864
##   HashtagGamer  117517  103628  183204   31739  669976   81803
##   LeftTroll     497796 1106698  647045   94485  395972  348725
##   NewsFeed     2715106  331987  525710   91164  352709  428937
##   RightTroll    910965  498983 1147854  113829  534146 2420880
##   Unknown         7622    5198   12808    1497   11282    4605

This table is a bit hard to read, because it's frequencies, and the total number of words for each topic and account category differ. But we can solve that problem by asking instead for proportions. I'll have it generate proportions by column, so we can see the top account category associated with each topic.

options(scipen = 999)
prop.table(cat_by_topic, 2) #column percentages - which topic is each category most associated with
##               
##                          1           2           3           4           5
##   Commercial   0.008863958 0.016398059 0.019060352 0.740210550 0.028443218
##   Fearmonger   0.002138364 0.001812945 0.014336458 0.001177579 0.004098712
##   HashtagGamer 0.027353230 0.049714697 0.070366404 0.024670084 0.330013053
##   LeftTroll    0.115866885 0.530929442 0.248522031 0.073441282 0.195045686
##   NewsFeed     0.631967460 0.159268087 0.201918749 0.070859936 0.173735438
##   RightTroll   0.212036008 0.239383071 0.440876611 0.088476982 0.263106667
##   Unknown      0.001774095 0.002493699 0.004919395 0.001163588 0.005557225
##               
##                          6
##   Commercial   0.005856411
##   Fearmonger   0.001469844
##   HashtagGamer 0.024719917
##   LeftTroll    0.105380646
##   NewsFeed     0.129619781
##   RightTroll   0.731561824
##   Unknown      0.001391578

Category 1 is News Feed, Category 2 Left Troll, Category 4 Commercial, and Category 5 Hashtag Gamer. But look at Categories 3 and 6. For both, the highest percentage is Right Troll. Fearmonger is not most strongly associated with any specific topic. What happens if we instead ask for a proportion table by row, which tells us which category each topic most associated with?

prop.table(cat_by_topic, 1) #row percentages - which category is each topic most associated with
##               
##                         1          2          3          4          5
##   Commercial   0.03307679 0.02968851 0.04310266 0.82714465 0.05015456
##   Fearmonger   0.14135586 0.05814562 0.57431684 0.02331056 0.12803114
##   HashtagGamer 0.09893111 0.08723872 0.15422939 0.02671932 0.56401601
##   LeftTroll    0.16106145 0.35807114 0.20935083 0.03057054 0.12811638
##   NewsFeed     0.61073827 0.07467744 0.11825366 0.02050651 0.07933866
##   RightTroll   0.16190164 0.08868197 0.20400284 0.02023031 0.09493132
##   Unknown      0.17720636 0.12085000 0.29777736 0.03480424 0.26229889
##               
##                         6
##   Commercial   0.01683284
##   Fearmonger   0.07483998
##   HashtagGamer 0.06886545
##   LeftTroll    0.11282966
##   NewsFeed     0.09648546
##   RightTroll   0.43025192
##   Unknown      0.10706315

Based on these results, Fearmonger now seems closest to Category 3 and Right Troll with Category 6. But Right Troll also shows up on Categories 3 (20%) and 1 (16%). Left Trolls show up in these categories at nearly exact proportions. It appears, then, that political trolls show strong similarity in topics with Fearmongers (stirring things up) and News Feed ("informing") trolls. Unknown isn't the top contributer to any topic, but it aligns with Categories 3 (showing elements of Fearmongering) and 5 (showing elements of Hashtag Gaming). Let's focus in on 5 categories.

categories <- c("Fearmonger", "HashtagGamer", "LeftTroll", "NewsFeed", "RightTroll")

politics_fear_hash <- tweetwords %>%
  filter(account_category %in% categories)

PFH_counts <- politics_fear_hash %>%
  count(account_category, topic, word, sort = TRUE) %>%
  ungroup()

For now, let's define our topics like this: 1 = News Feed, 2 = Left Troll, 3 = Fearmonger, 4 = Commercial, 5 = Hashtag Gamer, and 6 = Right Troll. We'll ask R to go through our PFH dataset and tell us when account category topic matches and when it mismatches. Then we can look at these terms.

PFH_counts$match <- ifelse(PFH_counts$account_category == "NewsFeed" & PFH_counts$topic == 1,PFH_counts$match <- "Match",
                           ifelse(PFH_counts$account_category == "LeftTroll" & PFH_counts$topic == 2,PFH_counts$match <- "Match",
                                  ifelse(PFH_counts$account_category == "Fearmonger" & PFH_counts$topic == 3,PFH_counts$match <- "Match",
                                         ifelse(PFH_counts$account_category == "HashtagGamer" & PFH_counts$topic == 5,PFH_counts$match <- "Match",
                                                ifelse(PFH_counts$account_category == "RightTroll" & PFH_counts$topic == 6,PFH_counts$match <- "Match",
                                                       PFH_counts$match <- "NonMatch")))))

top_PFH <- PFH_counts %>%
  group_by(account_category, match) %>%
  top_n(15, n) %>%
  ungroup() %>%
  arrange(account_category, -n)

top_PFH %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = factor(match))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~account_category, scales = "free") +
  coord_flip()

Red indicates a match and blue indicates a mismatch. So when Fearmongers talk about food poisoning or Koch Farms, it's a match, but when they talk about Hillary Clinton or the police, it's a mismatch. Terms like "MAGA" and "CNN" are matches for Right Trolls but "news" and "love" are mismatches. Left Trolls show a match when tweeting about "Black Lives Matter" or "police" but a mismatch when tweeting about "Trump" or "America." An interesting observation is that Trump is a mismatch for every topic it's displayed under on the plot. (Now, realdonaldtrump, Trump's Twitter handle, is a match for Right Trolls.) So where does that term, and associated terms like "Donald", belong?

tweetwords %>%
  filter(word %in% c("donald", "trump"))
## # A tibble: 157,844 x 7
##    author publish_date     account_category id         word  topic    beta
##    <chr>  <chr>            <chr>            <chr>      <chr> <int>   <dbl>
##  1 10_GOP 10/1/2017 22:43  RightTroll       C:/Users/~ trump     3 0.0183 
##  2 10_GOP 10/1/2017 23:52  RightTroll       C:/Users/~ trump     3 0.0183 
##  3 10_GOP 10/1/2017 2:47   RightTroll       C:/Users/~ dona~     3 0.00236
##  4 10_GOP 10/1/2017 2:47   RightTroll       C:/Users/~ trump     3 0.0183 
##  5 10_GOP 10/1/2017 3:47   RightTroll       C:/Users/~ trump     3 0.0183 
##  6 10_GOP 10/10/2017 20:57 RightTroll       C:/Users/~ trump     3 0.0183 
##  7 10_GOP 10/10/2017 23:42 RightTroll       C:/Users/~ trump     3 0.0183 
##  8 10_GOP 10/11/2017 22:14 RightTroll       C:/Users/~ trump     3 0.0183 
##  9 10_GOP 10/11/2017 22:20 RightTroll       C:/Users/~ trump     3 0.0183 
## 10 10_GOP 10/12/2017 0:38  RightTroll       C:/Users/~ trump     3 0.0183 
## # ... with 157,834 more rows

These terms apparently were sorted into Category 3, which we've called Fearmongers. Once again, this highlights the similarity between political trolls and fearmongering trolls in this dataset.

Diagramming Sentences

Confession: I never learned to diagram sentences.

Second confession: I'm really upset about this fact.

Solution: This great article taught me how to do it!

Last confession: This GIF came up when I searched for "Happy nerd"

Thursday, August 16, 2018

Finding Strengths

As part of my job and newly reorganized departments, my boss had many us take the Clifton StrengthsFinder, a measure developed by Donald O. Clifton and Gallup. This measure, developed through semi-structured interviews and subsequent psychometric research, identifies an individual's top 5 strengths from a list of 34. Here are my results:


The book that comes along with the assessment describes the 34 themes in detail and gives very basic information on the measure's development. But for the psychometricially inclined, you can read a detailed technical report of the measure's evidence for reliability and validity here. In general, the measure shows acceptable reliability and construct validity. There are moderate to strong correlations with the Big Five Personality Traits. My themes, specifically, relate to my high Agreeableness and Openness to Experience on the Big Five. (For comparison, here are my Myers-Briggs results.)

The report also talks about how the themes relate to leadership potential. What I'm best at, according to these results, are Relationship Building and Strategic Thinking.

And, of course, I always enjoy taking tests and measures, especially if I think they'll tell me something about myself.