Thursday, June 14, 2018

Working with Your Facebook Data in R

How to Read in and Clean Your Facebook Data - I recently learned that you can download all of your Facebook data, so I decided to check it out and bring it into R. To access your data, go to Facebook, and click on the white down arrow in the upper-right corner. From there, select Settings, then, from the column on the left, "Your Facebook Information." When you get the Facebook Information screen, select "View" next to "Download Your Information." On this screen, you'll be able to select the kind of data you want, a date range, and format. I only wanted my posts, so under "Your Information," I deselected everything but the first item on the list, "Posts." (Note that this will still download all photos and videos you posted, so it will be a large file.) To make it easy to bring into R, I selected JSON under Format (the other option is HTML).

After you click "Create File," it will take a while to compile - you'll get an email when it's ready. You'll need to reenter your password when you go to download the file.

The result is a Zip file, which contains folders for Posts, Photos, and Videos. Posts includes your own posts (on your and others' timelines) as well as posts from others on your timeline. And, of course, the file needed a bit of cleaning. Here's what I did.

Since the post data is a JSON file, I need the jsonlite package to read it.


FBposts <- fromJSON("your_posts.json")

This creates a large list object, with my data in a data frame. So as I did with the Taylor Swift albums, I can pull out that data frame.

myposts <- FBposts$status_updates

The resulting data frame has 5 columns: timestamp, which is in UNIX format; attachments, any photos, videos, URLs, or Facebook events attached to the post; title, which always starts with the author of the post (you or your friend who posted on your timeline) followed by the type of post; data, the text of the post; and tags, the people you tagged in the post.

First, I converted the timestamp to datetime, using the anytime package.


myposts$timestamp <- anytime(myposts$timestamp)

Next, I wanted to pull out post author, so that I could easily filter the data frame to only use my own posts.

myposts$author <- word(string = myposts$title, start = 1, end = 2, sep = fixed(" "))

Finally, I was interested in extracting URLs I shared (mostly from YouTube or my own blog) and the text of my posts, which I did with some regular expression functions and some help from Stack Overflow (here and here).

url_pattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"

myposts$links <- str_extract(myposts$attachments, url_pattern)

myposts$posttext <- myposts$data %>%
  rm_between('"','"',extract = TRUE)

There's more cleaning I could do, but this gets me a data frame I could use for some text analysis. Let's look at my most frequent words.

myposts$posttext <- as.character(myposts$posttext)
mypost_text <- myposts %>%
  unnest_tokens(word, posttext) %>%
## Joining, by = "word"
counts <- mypost_text %>%
  filter(author == "Sara Locatelli") %>%
  drop_na(word) %>%
  count(word, sort = TRUE)

## # A tibble: 9,753 x 2
##    word         n
##    <chr>    <int>
##  1 happy     4702
##  2 birthday  4643
##  3 today's    666
##  4 song       648
##  5 head       636
##  6 day        337
##  7 post       321
##  8 009f       287
##  9 ð          287
## 10 008e       266
## # ... with 9,743 more rows

These data include all my posts, including writing "Happy birthday" on other's timelines. I also frequently post the song in my head when I wake up in the morning (over 600 times, it seems). If I wanted to remove those, and only include times I said happy or song outside of those posts, I'd need to apply the filter in a previous step. There are also some strange characters that I want to clean from the data before I do anything else with them. I can easily remove these characters and numbers with string detect, but cells that contain numbers and letters, such as "008e" won't be cut out with that function. So I'll just filter them out separately.

drop_nums <- c("008a","008e","009a","009c","009f")

counts <- counts %>%
  filter(str_detect(word, "[a-z]+"),
         !word %in% str_detect(word, "[0-9]"),
         !word %in% drop_nums)

Now I could, for instance, create a word cloud.

counts %>%
  with(wordcloud(word, n, max.words = 50))

In addition to posting for birthdays and head songs, I talk a lot about statistics, data, analysis, and my blog. I also post about beer, concerts, friends, books, and Chicago. Let's see what happens if I mix in some sentiment analysis to my word cloud.

## Attaching package: 'reshape2'
counts %>%
  inner_join(get_sentiments("bing")) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>% = c("red","blue"), max.words = 100)
## Joining, by = "word"

Once again, a few words are likely being misclassified - regression and plot are both negatively-valenced, but I imagine I'm using them in the statistical sense instead of the negative sense. I also apparently use "died" or "die" but I suspect in the context of, "I died laughing at this." And "happy" is huge, because it includes birthday wishes as well as instances where I talk about happiness. Some additional cleaning and exploration of the data is certainly needed. But that's enough to get started with this huge example of "me-search."


  1. Great post! I'm currently trying to do the same with private messages and would love to see another post on how to extract those.

  2. Great.... But for french people, the encoding of the facebook json posts file seems problematic for fromJSON....

    1. Oh no! I wonder if treating it as text might help deal with the different encoding and characters. The readtext package has a way of reading JSON: There's another JSON library for R, rjson that might help:

  3. Hi, I think this is really cool. Thanks for the comprehensive code which I have used it on my own FB data. However, the last part on sentiment analysis of positive and negative words, the words don’t seem to be mine.