Thursday, June 14, 2018

Working with Your Facebook Data in R

How to Read in and Clean Your Facebook Data - I recently learned that you can download all of your Facebook data, so I decided to check it out and bring it into R. To access your data, go to Facebook, and click on the white down arrow in the upper-right corner. From there, select Settings, then, from the column on the left, "Your Facebook Information." When you get the Facebook Information screen, select "View" next to "Download Your Information." On this screen, you'll be able to select the kind of data you want, a date range, and format. I only wanted my posts, so under "Your Information," I deselected everything but the first item on the list, "Posts." (Note that this will still download all photos and videos you posted, so it will be a large file.) To make it easy to bring into R, I selected JSON under Format (the other option is HTML).


After you click "Create File," it will take a while to compile - you'll get an email when it's ready. You'll need to reenter your password when you go to download the file.

The result is a Zip file, which contains folders for Posts, Photos, and Videos. Posts includes your own posts (on your and others' timelines) as well as posts from others on your timeline. And, of course, the file needed a bit of cleaning. Here's what I did.

Since the post data is a JSON file, I need the jsonlite package to read it.

setwd("C:/Users/slocatelli/Downloads/facebook-saralocatelli35/posts")
library(jsonlite)

FBposts <- fromJSON("your_posts.json")

This creates a large list object, with my data in a data frame. So as I did with the Taylor Swift albums, I can pull out that data frame.

myposts <- FBposts$status_updates

The resulting data frame has 5 columns: timestamp, which is in UNIX format; attachments, any photos, videos, URLs, or Facebook events attached to the post; title, which always starts with the author of the post (you or your friend who posted on your timeline) followed by the type of post; data, the text of the post; and tags, the people you tagged in the post.

First, I converted the timestamp to datetime, using the anytime package.

library(anytime)

myposts$timestamp <- anytime(myposts$timestamp)

Next, I wanted to pull out post author, so that I could easily filter the data frame to only use my own posts.

library(tidyverse)
myposts$author <- word(string = myposts$title, start = 1, end = 2, sep = fixed(" "))

Finally, I was interested in extracting URLs I shared (mostly from YouTube or my own blog) and the text of my posts, which I did with some regular expression functions and some help from Stack Overflow (here and here).

url_pattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"

myposts$links <- str_extract(myposts$attachments, url_pattern)

library(qdapRegex)
myposts$posttext <- myposts$data %>%
  rm_between('"','"',extract = TRUE)

There's more cleaning I could do, but this gets me a data frame I could use for some text analysis. Let's look at my most frequent words.

myposts$posttext <- as.character(myposts$posttext)
library(tidytext)
mypost_text <- myposts %>%
  unnest_tokens(word, posttext) %>%
  anti_join(stop_words)
## Joining, by = "word"
counts <- mypost_text %>%
  filter(author == "Sara Locatelli") %>%
  drop_na(word) %>%
  count(word, sort = TRUE)

counts
## # A tibble: 9,753 x 2
##    word         n
##    <chr>    <int>
##  1 happy     4702
##  2 birthday  4643
##  3 today's    666
##  4 song       648
##  5 head       636
##  6 day        337
##  7 post       321
##  8 009f       287
##  9 ð          287
## 10 008e       266
## # ... with 9,743 more rows

These data include all my posts, including writing "Happy birthday" on other's timelines. I also frequently post the song in my head when I wake up in the morning (over 600 times, it seems). If I wanted to remove those, and only include times I said happy or song outside of those posts, I'd need to apply the filter in a previous step. There are also some strange characters that I want to clean from the data before I do anything else with them. I can easily remove these characters and numbers with string detect, but cells that contain numbers and letters, such as "008e" won't be cut out with that function. So I'll just filter them out separately.

drop_nums <- c("008a","008e","009a","009c","009f")

counts <- counts %>%
  filter(str_detect(word, "[a-z]+"),
         !word %in% str_detect(word, "[0-9]"),
         !word %in% drop_nums)

Now I could, for instance, create a word cloud.

library(wordcloud)
counts %>%
  with(wordcloud(word, n, max.words = 50))

In addition to posting for birthdays and head songs, I talk a lot about statistics, data, analysis, and my blog. I also post about beer, concerts, friends, books, and Chicago. Let's see what happens if I mix in some sentiment analysis to my word cloud.

library(reshape2)
## 
## Attaching package: 'reshape2'
counts %>%
  inner_join(get_sentiments("bing")) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("red","blue"), max.words = 100)
## Joining, by = "word"

Once again, a few words are likely being misclassified - regression and plot are both negatively-valenced, but I imagine I'm using them in the statistical sense instead of the negative sense. I also apparently use "died" or "die" but I suspect in the context of, "I died laughing at this." And "happy" is huge, because it includes birthday wishes as well as instances where I talk about happiness. Some additional cleaning and exploration of the data is certainly needed. But that's enough to get started with this huge example of "me-search."

Tuesday, June 12, 2018

Beautiful Visualizations in R

I recently discovered the R Graph Gallery, where users can share the beautiful visualizations they've created using R and its various libraries (especially ggplot2). One of my favorite parts about this gallery is a section called Art From Data, in which users create works of art, sometimes with real data, and sometimes with a random number generator and a little imagination.

Last night, I completed a DataCamp project to learn how to draw flowers in R and ggplot2. Based on that, I created this little yellow flower:


Not only was the flower fun to create, it made me think about the data and how it would appear spatially. As I try to create new and more complex images, I have to keep building on and challenging those skills. It's a good exercise to get you thinking about data.

Sunday, June 10, 2018

Statistics Sunday: Creating Wordclouds

Cloudy with a Chance of Words Lots of fun projects in the works, so today's post will be short - a demonstration on how to create wordclouds, both with and without sentiment analysis results. While I could use song lyrics again, I decided to use a different dataset that comes with the quanteda packages: all 58 Inaugural Addresses, from Washington's first speech in 1789 to Trump's in 2017.

library(quanteda) #install with install.packages("quanteda") if needed
data(data_corpus_inaugural)
speeches <- data_corpus_inaugural$documents
row.names(speeches) <- NULL

As you can see, this dataset has each Inaugural Address in a column called "texts," with year and President's name as additional variables. To analyze the words in the speeches, and generate a wordcloud, we'll want to unnest the words in the texts column.

library(tidytext)
library(tidyverse)
speeches_tidy <- speeches %>%
  unnest_tokens(word, texts) %>%
  anti_join(stop_words)
## Joining, by = "word"

For our first wordcloud, let's see what are the most common words across all speeches.

library(wordcloud) #install.packages("wordcloud") if needed
speeches_tidy %>%
  count(word, sort = TRUE) %>%
  with(wordcloud(word, n, max.words = 50))
While the language used by Presidents certainly varies by time period and the national situation, these speeches refer often to the people and the government; in fact, most of the larger words directly reference the United States and Americans. The speeches address the role of "president" and likely the "duty" that role entails. The word "peace" is only slightly larger than "war," and one could probably map out which speeches were given during wartime and which weren't.

We could very easily create a wordcloud for one President specifically. For instance, let's create one for Obama, since he provides us with two speeches worth of words. But to take things up a notch, let's add sentiment information to our wordcloud. To do that, we'll use the comparison.cloud function; we'll also need the reshape2 library.

library(reshape2) #install.packages("reshape2") if needed
obama_words <- speeches_tidy %>%
  filter(President == "Obama") %>%
  count(word, sort = TRUE)

obama_words %>%
  inner_join(get_sentiments("nrc") %>%
               filter(sentiment %in% c("positive",
                                       "negative"))) %>%
  filter(n > 1) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("red","blue"))
## Joining, by = "word"
The acast statement reshapes the data, putting our sentiments of positive and negative as separate columns. Setting fill = 0 is important, since a negative word will be missing a value for the positive column and vice versa; without fill = 0, it would drop any row with NA in one of the columns (which would be every word in the set). As a sidenote, we could use the comparison cloud to compare words across two documents, such as comparing two Presidents. The columns would be counts for each President, as opposed to count by sentiment.

Interestingly, the NRC classifies "government" and "words" as negative. But even if we ignore those two words, which are Obama's most frequent, the negatively-valenced words are much larger than most of his positively-valenced words. So while he uses many more positively-valenced words than negatively-valenced words - seen by the sheer number of blue words - he uses the negatively-valenced words more often. If you were so inclined, you could probably run a sentiment analysis on his speeches and see if they tend to be more positive or negative, and/or if they follow arcs of negativity and positivity. And feel free to generate your own wordcloud: all you'd need to do is change the filter(President == "") to whatever President you're interested in examining (or whatever text data you'd like to use, if President's speeches aren't your thing).

Wednesday, June 6, 2018

The Importance of Training Data

In the movie Arrival, 12 alien ships visit Earth, landing at various locations around the world. Dr. Louise Banks, a linguist, is brought in to help make contact with the aliens who have landed in Montana.

With no experience with their language, Louise instead teaches the aliens, which they call heptapods, English while they teach her their own language. Other countries around the world follow suit. But China seems to view the heptapods with suspicion that turns into outright hostility later on.

Dr. Banks learns that China was using Mahjong to teach/communicate with the heptapods. She points out that using a game in this way changes the nature of the communication - everything becomes a competition with winners and loser. As they said in the movie, paraphrasing something said by Abraham Maslow, "If all I ever gave you was a hammer, everything is a nail."

Training material matters and can drastically affect the outcome. Just look at Norman, the psychopathic AI developed at MIT. As described in an article from BBC:
The psychopathic algorithm was created by a team at the Massachusetts Institute of Technology, as part of an experiment to see what training AI on data from "the dark corners of the net" would do to its world view.

The software was shown images of people dying in gruesome circumstances, culled from a group on the website Reddit.

Then the AI, which can interpret pictures and describe what it sees in text form, was shown inkblot drawings and asked what it saw in them.

Norman's view was unremittingly bleak - it saw dead bodies, blood and destruction in every image.

Alongside Norman, another AI was trained on more normal images of cats, birds and people.

It saw far more cheerful images in the same abstract blots.

The fact that Norman's responses were so much darker illustrates a harsh reality in the new world of machine learning, said Prof Iyad Rahwan, part of the three-person team from MIT's Media Lab which developed Norman.

"Data matters more than the algorithm."
This finding is especially important when you consider how AI has and can be used - for instance, in assessing risk of reoffending among parolees or determining which credit card applications to approve. If the training data is flawed, the outcome will be too. In fact, a book I read last year, Weapons of Math Destruction, discusses how decisions made by AI can reflect the biases of their creators. I highly recommend reading it!

Tuesday, June 5, 2018

Statistics "Sunday": More Sentiment Analysis Resources

I've just returned from a business trip - lots of long days, working sometimes from 8 am to 9 pm or 10 pm. I didn't get a chance to write this week's Statistics Sunday post, in part because I wasn't entirely certain what to write about. But as I started digging into sentiment analysis tools for a fun project I'm working on - and will hopefully post about soon - I found a few things I wanted to share.

The tidytext package in R is great for tokenizing text and running sentiment analysis with 4 dictionaries: Afinn, Bing, Loughran, and NRC. During some web searches for additional tricks to use with these tools, I found another R package: syuzhet, which includes Afinn, Bing, and NRC, as well as Syuzhet, developed in the Nebraska Literacy Lab, and a method to access the powerful Stanford Natural Language Processing and sentiment analysis software, which can predict sentiment through deep learning methods.

I plan to keep using the tidytext package for much of this analysis, but will probably draw upon the syuzhet package for some of the sentiment analysis, especially to use the Stanford methods. And there are still some big changes on the horizon for Deeply Trivial, including more videos and a new look!

Tuesday, May 29, 2018

Stress and Its Effect on the Body

After a stressful winter and spring, I'm finally taking a break from work. So of course, what better time to get sick? After a 4-day migraine (started on my first day of vacation - Friday) with a tension headache and neck spasm so bad I couldn't look left, I ended up in urgent care yesterday afternoon. One injection of muscle relaxer, plus prescriptions for more muscle relaxers and migraine meds, and I'm finally feeling better.

Why does this happen? Why is it that after weeks or months of stress, we get sick when we finally get to "come down"?

I've blogged a bit about stress before. Stress causes your body to release certain hormones, such as adrenaline and norepinephrine, which gives an immediate physiological response to stress, and cortisol, which takes a bit longer for you to feel at work in your body. And in fact, cortisol is also involved in many negative consequences of chronic stress. Over time, it can do things like increase blood sugar, suppress the immune system, and contribute to acne breakouts.

You're probably aware that symptoms of sickness are generally caused by your body reacting to and fighting the infection or virus. So the reason you suddenly get sick when the stressor goes away is because your immune system increases function, realizes there's a foreign body that doesn't belong, and starts fighting it. You had probably already caught the virus or infection, but didn't have symptoms like fever (your body's attempt to "cook" it out) or runny nose (your body increasing mucus production to push out the bug), that let you know you were sick.

And in my case in particular, a study published in Neurology found that migraine sufferers were at increased risk of an attack after the stress "let-down." According to the researchers, this effect is even stronger when there is a huge build-up of stress and a sudden, large let-down; it's better to have mini let-downs throughout the stressful experience.

And here I thought I was engaging in a good amount of self-care throughout my stressful February-May.

Sunday, May 27, 2018

Statistics Sunday: Two R Packages to Check Out

I'm currently out of town and not spending as much time on my computer as I have over the last couple months. (It's what happens when you're the only one in your department at work and also most of your hobbies involve a computer.) But I wanted to write up something for Statistics Sunday and I recently discovered two R packages I need to check out in the near future.

The first is called echor, which allows you to search and download data directly from the US Environmental Protection Agency (EPA) Environmental Compliance and History Online (ECHO), using the ECHO-API. According to the vignette, linked above, "ECHO provides data for:
  • Stationary sources permitted under the Clean Air Act, including data from the National Emissions Inventory, Greenhouse Gas Reporting Program, Toxics Release Inventory, and Clean Air Markets Division Acid Rain Program and Clean Air Interstate Rule. 
  • Public drinking water systems permitted under the Safe Drinking Water Act, including data from the Safe Drinking Water Information System. 
  • Hazardous Waste Handlers permitted under the Resource Conservation and Recovery Act, with data drawn from the RCRAInfo data system. 
  • Facilities permitted under the Clean Water Act and the National Pollutant Discharge Elimination Systems (NPDES) program, including data from EPA’s ICIS-NPDES system and possibly waterbody information from EPA’s ATTAINS data system."
The second package is papaja, or Preparing APA Journal Articles, which uses RStudio and R Markdown to create APA-formatted papers. Back when I wrote APA style papers regularly, I had Word styles set up to automatically format headers and subheaders, but properly formatting tables and charts was another story. This package promises to do all of that. It's still in development, but you can find out more about it here and here.

I have some fun analysis projects in the works! Stay tuned.