I reran much of the original code - creating the file, removing non-English tweets and URLs, generating the DTM and conducting the 6-topic LDA. For brevity, I'm not including it in this post, but once again, you can see it here.
I will note that the topics were numbered a bit differently than they were in my previous analysis. Here's the new plot. The results look very similar to before. (LDA is a variational Bayesian method and there is an element of randomness to it, so the results aren't a one-to-one match, but they're very close.)
top_terms <- tweet_topics %>% group_by(topic) %>% top_n(15, beta) %>% ungroup() %>% arrange(topic, -beta) top_terms %>% mutate(term = reorder(term, beta)) %>% ggplot(aes(term, beta, fill = factor(topic))) + geom_col(show.legend = FALSE) + facet_wrap(~topic, scales = "free") + coord_flip()
word_topic <- tweet_topics %>% group_by(term) %>% top_n(1, beta) %>% ungroup()
I can then match this dataset up to the original tweetwords dataset, to show which topic each word is most strongly associated with. Because the word variable is known by two different variable names in my datasets, I need to tell R how to match.
tweetwords <- tweetwords %>% left_join(word_topic, by = c("word" = "term"))
Now we can generate a crosstable, which displays the matchup between LDA topic (1-6) and account category (Commercial, Fearmonger, Hashtag Gamer, Left Troll, News Feed, Right Troll, and Unknown).
cat_by_topic <- table(tweetwords$account_category, tweetwords$topic) cat_by_topic
## ## 1 2 3 4 5 6 ## Commercial 38082 34181 49625 952309 57744 19380 ## Fearmonger 9187 3779 37326 1515 8321 4864 ## HashtagGamer 117517 103628 183204 31739 669976 81803 ## LeftTroll 497796 1106698 647045 94485 395972 348725 ## NewsFeed 2715106 331987 525710 91164 352709 428937 ## RightTroll 910965 498983 1147854 113829 534146 2420880 ## Unknown 7622 5198 12808 1497 11282 4605
This table is a bit hard to read, because it's frequencies, and the total number of words for each topic and account category differ. But we can solve that problem by asking instead for proportions. I'll have it generate proportions by column, so we can see the top account category associated with each topic.
options(scipen = 999) prop.table(cat_by_topic, 2) #column percentages - which topic is each category most associated with
## ## 1 2 3 4 5 ## Commercial 0.008863958 0.016398059 0.019060352 0.740210550 0.028443218 ## Fearmonger 0.002138364 0.001812945 0.014336458 0.001177579 0.004098712 ## HashtagGamer 0.027353230 0.049714697 0.070366404 0.024670084 0.330013053 ## LeftTroll 0.115866885 0.530929442 0.248522031 0.073441282 0.195045686 ## NewsFeed 0.631967460 0.159268087 0.201918749 0.070859936 0.173735438 ## RightTroll 0.212036008 0.239383071 0.440876611 0.088476982 0.263106667 ## Unknown 0.001774095 0.002493699 0.004919395 0.001163588 0.005557225 ## ## 6 ## Commercial 0.005856411 ## Fearmonger 0.001469844 ## HashtagGamer 0.024719917 ## LeftTroll 0.105380646 ## NewsFeed 0.129619781 ## RightTroll 0.731561824 ## Unknown 0.001391578
Category 1 is News Feed, Category 2 Left Troll, Category 4 Commercial, and Category 5 Hashtag Gamer. But look at Categories 3 and 6. For both, the highest percentage is Right Troll. Fearmonger is not most strongly associated with any specific topic. What happens if we instead ask for a proportion table by row, which tells us which category each topic most associated with?
prop.table(cat_by_topic, 1) #row percentages - which category is each topic most associated with
## ## 1 2 3 4 5 ## Commercial 0.03307679 0.02968851 0.04310266 0.82714465 0.05015456 ## Fearmonger 0.14135586 0.05814562 0.57431684 0.02331056 0.12803114 ## HashtagGamer 0.09893111 0.08723872 0.15422939 0.02671932 0.56401601 ## LeftTroll 0.16106145 0.35807114 0.20935083 0.03057054 0.12811638 ## NewsFeed 0.61073827 0.07467744 0.11825366 0.02050651 0.07933866 ## RightTroll 0.16190164 0.08868197 0.20400284 0.02023031 0.09493132 ## Unknown 0.17720636 0.12085000 0.29777736 0.03480424 0.26229889 ## ## 6 ## Commercial 0.01683284 ## Fearmonger 0.07483998 ## HashtagGamer 0.06886545 ## LeftTroll 0.11282966 ## NewsFeed 0.09648546 ## RightTroll 0.43025192 ## Unknown 0.10706315
Based on these results, Fearmonger now seems closest to Category 3 and Right Troll with Category 6. But Right Troll also shows up on Categories 3 (20%) and 1 (16%). Left Trolls show up in these categories at nearly exact proportions. It appears, then, that political trolls show strong similarity in topics with Fearmongers (stirring things up) and News Feed ("informing") trolls. Unknown isn't the top contributer to any topic, but it aligns with Categories 3 (showing elements of Fearmongering) and 5 (showing elements of Hashtag Gaming). Let's focus in on 5 categories.
categories <- c("Fearmonger", "HashtagGamer", "LeftTroll", "NewsFeed", "RightTroll") politics_fear_hash <- tweetwords %>% filter(account_category %in% categories) PFH_counts <- politics_fear_hash %>% count(account_category, topic, word, sort = TRUE) %>% ungroup()
For now, let's define our topics like this: 1 = News Feed, 2 = Left Troll, 3 = Fearmonger, 4 = Commercial, 5 = Hashtag Gamer, and 6 = Right Troll. We'll ask R to go through our PFH dataset and tell us when account category topic matches and when it mismatches. Then we can look at these terms.
PFH_counts$match <- ifelse(PFH_counts$account_category == "NewsFeed" & PFH_counts$topic == 1,PFH_counts$match <- "Match", ifelse(PFH_counts$account_category == "LeftTroll" & PFH_counts$topic == 2,PFH_counts$match <- "Match", ifelse(PFH_counts$account_category == "Fearmonger" & PFH_counts$topic == 3,PFH_counts$match <- "Match", ifelse(PFH_counts$account_category == "HashtagGamer" & PFH_counts$topic == 5,PFH_counts$match <- "Match", ifelse(PFH_counts$account_category == "RightTroll" & PFH_counts$topic == 6,PFH_counts$match <- "Match", PFH_counts$match <- "NonMatch"))))) top_PFH <- PFH_counts %>% group_by(account_category, match) %>% top_n(15, n) %>% ungroup() %>% arrange(account_category, -n) top_PFH %>% mutate(word = reorder(word, n)) %>% ggplot(aes(word, n, fill = factor(match))) + geom_col(show.legend = FALSE) + facet_wrap(~account_category, scales = "free") + coord_flip()
tweetwords %>% filter(word %in% c("donald", "trump"))
## # A tibble: 157,844 x 7 ## author publish_date account_category id word topic beta ## <chr> <chr> <chr> <chr> <chr> <int> <dbl> ## 1 10_GOP 10/1/2017 22:43 RightTroll C:/Users/~ trump 3 0.0183 ## 2 10_GOP 10/1/2017 23:52 RightTroll C:/Users/~ trump 3 0.0183 ## 3 10_GOP 10/1/2017 2:47 RightTroll C:/Users/~ dona~ 3 0.00236 ## 4 10_GOP 10/1/2017 2:47 RightTroll C:/Users/~ trump 3 0.0183 ## 5 10_GOP 10/1/2017 3:47 RightTroll C:/Users/~ trump 3 0.0183 ## 6 10_GOP 10/10/2017 20:57 RightTroll C:/Users/~ trump 3 0.0183 ## 7 10_GOP 10/10/2017 23:42 RightTroll C:/Users/~ trump 3 0.0183 ## 8 10_GOP 10/11/2017 22:14 RightTroll C:/Users/~ trump 3 0.0183 ## 9 10_GOP 10/11/2017 22:20 RightTroll C:/Users/~ trump 3 0.0183 ## 10 10_GOP 10/12/2017 0:38 RightTroll C:/Users/~ trump 3 0.0183 ## # ... with 157,834 more rows
These terms apparently were sorted into Category 3, which we've called Fearmongers. Once again, this highlights the similarity between political trolls and fearmongering trolls in this dataset.
No comments:
Post a Comment