library(quanteda) #install with install.packages("quanteda") if needed
data(data_corpus_inaugural) speeches <- data_corpus_inaugural$documents row.names(speeches) <- NULL
As you can see, this dataset has each Inaugural Address in a column called "texts," with year and President's name as additional variables. To analyze the words in the speeches, and generate a wordcloud, we'll want to unnest the words in the texts column.
speeches_tidy <- speeches %>% unnest_tokens(word, texts) %>% anti_join(stop_words)
For our first wordcloud, let's see what are the most common words across all speeches.
While the language used by Presidents certainly varies by time period and the national situation, these speeches refer often to the people and the government; in fact, most of the larger words directly reference the United States and Americans. The speeches address the role of "president" and likely the "duty" that role entails. The word "peace" is only slightly larger than "war," and one could probably map out which speeches were given during wartime and which weren't.
library(wordcloud) #install.packages("wordcloud") if needed
speeches_tidy %>% count(word, sort = TRUE) %>% with(wordcloud(word, n, max.words = 50))
We could very easily create a wordcloud for one President specifically. For instance, let's create one for Obama, since he provides us with two speeches worth of words. But to take things up a notch, let's add sentiment information to our wordcloud. To do that, we'll use the comparison.cloud function; we'll also need the reshape2 library.
The acast statement reshapes the data, putting our sentiments of positive and negative as separate columns. Setting fill = 0 is important, since a negative word will be missing a value for the positive column and vice versa; without fill = 0, it would drop any row with NA in one of the columns (which would be every word in the set). As a sidenote, we could use the comparison cloud to compare words across two documents, such as comparing two Presidents. The columns would be counts for each President, as opposed to count by sentiment.
library(reshape2) #install.packages("reshape2") if needed
obama_words <- speeches_tidy %>% filter(President == "Obama") %>% count(word, sort = TRUE) obama_words %>% inner_join(get_sentiments("nrc") %>% filter(sentiment %in% c("positive", "negative"))) %>% filter(n > 1) %>% acast(word ~ sentiment, value.var = "n", fill = 0) %>% comparison.cloud(colors = c("red","blue"))
Interestingly, the NRC classifies "government" and "words" as negative. But even if we ignore those two words, which are Obama's most frequent, the negatively-valenced words are much larger than most of his positively-valenced words. So while he uses many more positively-valenced words than negatively-valenced words - seen by the sheer number of blue words - he uses the negatively-valenced words more often. If you were so inclined, you could probably run a sentiment analysis on his speeches and see if they tend to be more positive or negative, and/or if they follow arcs of negativity and positivity. And feel free to generate your own wordcloud: all you'd need to do is change the filter(President == "") to whatever President you're interested in examining (or whatever text data you'd like to use, if President's speeches aren't your thing).
Post a Comment