Deeply Trivial: June 2018

Saturday, June 30, 2018

What I Did on My Week's Vacation

It's been quite a week. I traveled to Kansas City Monday to spend time with my family. Tuesday was my parents' 47th wedding anniversary. Unfortunately, we ended up spending part of Tuesday at the ER when my dad's breathing problems were exacerbated by the hot weather and their air conditioner dying over the weekend. In fact, I was at the ER when a tornado warning was issued in Kansas City, MO, and I spent that time hunkered down in a back hallway waiting for it to pass. It turned out to be a small tornado, mostly knocking over some trees and dropping golf ball sized hail on part of the city. We returned home to restored air conditioning and a beautiful (but hot) rest of the day.

Wednesday, we stayed in and got take-out from Joe's Kansas City BBQ.

Thursday, I finally took the Boulevard Brewery tour, starting with a glass of Tank 7, and got to see the brewing operation.

Our tour guide showing where the brewery began.
This section was the entire operation in 1989, but now is used for experimental beers.

The centrifuge that filters the beer. If this thing ever broke down, we're told it would send a stream of beer at such force, we'd be sliced in half. We decided to refer to that phenomenon as a "beersaber."

The tour ended with a beer and food tasting, featuring The Calling Double IPA and an open-faced ham sandwich with Clementine-Thyme marmalade, The Sixth Glass Quadrupel with brie and fig, The Bourbon Barrel-Aged Quadrupel (BBQ, made from The Sixth Glass aged in bourbon barrels and my favorite from the tasting) and goat-cheese cheesecake with blackberry, and The Dark Truth Imperial Stout and a chocolate chip cookie. All food came from KC restaurants and bakeries.

I picked up some yummy beer, including a four-pack of the BBQ, and a jar of the marmalade to bring home to Chicago.

I did a lot of writing - more on that (and hopefully some good news) later - and finished a couple books. Now, back to Chicago!

Sunday, June 24, 2018

Statistics Sunday: Converting Between Effect Sizes for Meta-Analysis

Converting Between Effect Sizes I'm currently working on my promised video on mixed effects meta-analysis, and was planning on covering this particular topic in that video - converting between effect sizes. But I decided to do this as a separate post that I can reference in the video, which I hope to post next week.

As a brief refresher, meta-analysis is aimed at estimating the true effect (or effects) in an area of study by combining findings from multiple studies on that topic. Effect sizes, the most frequently used being Cohen's d, Pearson's r, and log odds ratio, are estimated from information in study reports and presentations. There's a lot of variation in how clearly reports and documents describe the findings and the information given to estimate the study's overall effect. But when you conduct the meta-analysis, whether using fixed, random, or mixed effects analysis, you need to use only one type of effect size. That means that, sometimes, studies will give you a different type of effect size than you plan to use. Fortunately, there are ways to convert between effect sizes and use different types of statistical information to generate your estimates.

First up, converting between those key effect sizes. In the meta-analysis I performed in grad school, I examined the effect of pretrial publicity on guilt. There are two ways guilt was frequently operationalized in the studies: as a guilty/not guilty verdict or as a continuous guilt rating. For those outcomes, we would likely use, respectively, log odds ratio and Cohen's d. The escalc function in the metafor package can compute log odds ratio for guilty/not guilty counts, and Cohen's d for mean and standard deviation of the guilt ratings. But studies may use different types of information when presenting their results, so you may not be able to simply compute those effect sizes.

For instance, a study using verdict may present a chi-square and one of its effect sizes, Cramer's V, which is very similar to a correlation coefficient. How can I convert that into log odds ratio?

To convert from one effect size to the other, you need to follow a prescribed path, which can be seen in the diagram below. What this diagram tells you is which effect sizes you can convert between directly: you can directly convert between log odds ratio and Cohen's d, and between Cohen's d and Pearson's r. If you wanted to convert between Pearson's r and log odds ratio, you'll first need to convert to Cohen's d. You'll need to do the same thing for variance - compute it for the native effect size metric, then convert that to the new effect size metric.

Let's start by setting up functions that will convert between our effect sizes for us, beginning with Cohen's d and log odds ratio. Then we'll demonstrate with some real data.

#Convert log odds ratio to d
ltod <- function(lor) {
  d = lor * (sqrt(3)/pi)
  return(d)
}
vltovd <- function(vl) {
  vd = vl * (3/pi^2)
  return(vd)
}

#Convert d to log odds ratio
dtol <- function(d) {
  lor = d*(pi/sqrt(3))
  return(lor)
}
vdtovl <- function(vd) {
  vl = vd*(pi^2/3)
  return(vl)
}

You'll notice a mathematical symmetry in these equations - the numerators and denominators switch between the equations. Now let's set up equations to r and d. These equations are slightly more complex and will require a few additional arguments. For instance, converting the variance of r to variance of d requires both the variance of r and r itself. Converting from d to r requires group sample sizes, referred to as n1 and n2.

#Convert r to d
rtod <- function(r) {
  d = (2*r)/(sqrt(1-r^2))
  return(d)
}
vrtovd <- function(vr,r) {
  vd = (4*vr)/(1-r^2)^3
  return(vd)
}

#Convert d to r
dtor <- function(n1,n2,d) {
  a = (n1+n2)^2/(n1*n2)
  r = d/(sqrt(d^2+a))
  return(r)
}
vdtovr <- function(n1,n2,vd,d) {
  a = (n1+n2)^2/(n1*n2)
  vr = a^2*vd/(d^2+a)^3
  return(vr)
}

Remember that the metafor package can compute effect sizes and variances for you, so you might want to run the escalc on the native effect sizes so that you have the estimates and variances you need to run these functions. But if you ever find yourself having to compute those variances by hand, here are the equations, which we'll use in the next step.

vard <- function(n1,n2,d) {
  vd = ((n1+n2)/(n1*n2)) + (d^2/(2*(n1+n2)))
  return(vd)
}

varr <- function(r,n) {
  vr = (1-r^2)^2/(n-1)
  return(vr)
}

varlor <- function(a,b,c,d) {
  vl = (1/a)+(1/b)+(1/c)+(1/d)
  return(vl)
}

One of the studies I included in my meta-analysis gave Cramer's V. It had a sample size of 42, with 21 people in each group. I'd like to convert that effect size to log odds ratio. Here's how I could do it.

cramerv <- 0.67
studyd <- rtod(cramerv)
studyvr <- varr(0.67,42)
studyvd <- vrtovd(studyvr,cramerv)
dtol(studyd)

## [1] 3.274001

vdtovl(studyvd)

## [1] 0.5824038

I can now include this study in my meta-analysis of log odds ratios.

What if my study gives different information? For instance, it might have given me a chi-square or a t-value. This online effect size calculator, created by David Wilson, coauthor of Practical Meta-Analysis, can compute effect sizes for you from many different types of information. In fact, spoiler alert: I used an earlier version of this calculator extensively for my meta-analysis. Note that this calculator returns odds ratios, so you'll need to convert those values into a log odds ratio.

Friday, June 22, 2018

Thanks for Reading!

As I've been blogging more about statistics, R, and research in general, I've been trying to increase my online presence, sharing my blog posts in groups of like-minded people. Those efforts seem to have paid off, based on my view counts over the past year:

And based on read counts, here are my top 10 blog posts, most of which are stats-related:

Beautiful Asymmetry - none of us is symmetrical, and that's okay
Statistical Sins: Stepwise Regression - just step away from stepwise regression
Statistics Sunday: What Are Degrees of Freedom? (Part 1) - and read Part 2 here
Working with Your Facebook Data in R
Statistics Sunday: Free Data Science and Statistics Resources
Statistics Sunday: What is Bootstrapping?
Statistical Sins: Know Your Variables (A Confession) - we all make mistakes, but we should learn from them
Statistical Sins: Not Making it Fun (A Thinly Veiled Excuse to Post a Bunch of XKCD Cartoons) - the subtitle says it all
Statistics Sunday: Taylor Swift vs. Lorde - Analyzing Song Lyrics - analyzing song lyrics is my jam
How Has Taylor Swift's Word Choice Changed Over Time? - ditto

It's so nice to see people are enjoying the posts, even sharing them and reaching out with additional thoughts and questions. Thanks, readers!

Paul McCartney on Carpool Karaoke

If you haven't watched yet, you need to check out the Carpool Karaoke that had James Corden in tears:

Tuesday, June 19, 2018

Purchases and Happiness

I've heard it said before that it's better to spend your money on experiences than materials. While I appreciate the sentiment - memories last longer than stuff - something about that statement has always bothered me and I couldn't put my finger on what. But a recent study in Psychological Science, "Experiential or Material Purchases? Social Class Determines Purchase Happiness," helped shed some light on when that sentiment might not be true.

The study is a meta-analysis of past research, as well as a report of 3 additional studies performed by the authors, to determine whether social class determines happiness with experiential versus material purchases. Their hypothesis was that experiences are more valuable to people with higher socioeconomic status - that is, because their material needs have been met, they can focus on higher needs - while materials would be more valuable to people with lower socioeconomic status - people who may struggling with basic needs like food and clothing. They confirmed their hypothesis, not only when examining participants' actual SES, but also when it was experimentally determined, by asking participants to imagine their monthly income had been increased or decreased. As they sum up in the article:

We argue that social class influences purchase happiness because resource abundance focuses people on internal states and goals, such as self-development, self-expression, and the pursuit of uniqueness (Kraus et al., 2012; Stephens et al., 2012; Stephens et al., 2007), whereas resource deprivation orients people toward resource management and spending money wisely (Fernbach et al., 2015; Van Boven & Gilovich, 2003). These fundamentally different value orientations translate into different purchase motives held by people from higher and lower classes (Lee, Priester, Hall, & Wood, 2018).

Monday, June 18, 2018

Statistics Sunday: Accessing the YouTube API with tuber

I haven't had a lot of time to play with this but yesterday, I discovered the tuber R package, which allows you to interact with the YouTube API.

To use the tuber package, not only do you need to install it in R - you'll need a Google account and will have to authorize 4 APIs through Developer Console: all 3 YouTube APIs (though the Data API will be doing the heavy lifting) and the Freebase API. Before you authorize the first API, Google will have you create a project to tie the APIs to. Then, you'll find the APIs in the API library to add to this project. Click on each API and on the next screen, select Enable. You'll need to create credentials for each of the YouTube APIs. When asked to identify the type of app that will accessing the YouTube API, select "Other".

The tuber package requires two pieces of information from you, which you'll get when you set up credentials for the OAuth 2.0 Client: client ID and client secret. Once you set those up, you can download them at any time in JSON format by going to the Credentials dashboard and clicking the download button on the far right.

In R, load the tuber package, then call the yt_oauth function, using the client ID (which should end with something like "apps.googleusercontent.com") and client secret (a string of letters and numbers). R will launch a browser window to authorize tuber to access the APIs. Once that's done, you'll be able to use the tuber package to, for instance, access data about a channel or get information about a video. My plan is to use my Facebook dataset to pull out the head songs I've shared and get the video information to generate a dataset of my songs. Look for more on that later. In the meantime, this great post on insightr can give you some details about using the tuber package.

[Apologies for the short and late post - I've been sick and haven't had as much time to write recently. Hopefully I'll get back to normal over the next week.]

Thursday, June 14, 2018

Working with Your Facebook Data in R

How to Read in and Clean Your Facebook Data - I recently learned that you can download all of your Facebook data, so I decided to check it out and bring it into R. To access your data, go to Facebook, and click on the white down arrow in the upper-right corner. From there, select Settings, then, from the column on the left, "Your Facebook Information." When you get the Facebook Information screen, select "View" next to "Download Your Information." On this screen, you'll be able to select the kind of data you want, a date range, and format. I only wanted my posts, so under "Your Information," I deselected everything but the first item on the list, "Posts." (Note that this will still download all photos and videos you posted, so it will be a large file.) To make it easy to bring into R, I selected JSON under Format (the other option is HTML).

After you click "Create File," it will take a while to compile - you'll get an email when it's ready. You'll need to reenter your password when you go to download the file.

The result is a Zip file, which contains folders for Posts, Photos, and Videos. Posts includes your own posts (on your and others' timelines) as well as posts from others on your timeline. And, of course, the file needed a bit of cleaning. Here's what I did.

Since the post data is a JSON file, I need the jsonlite package to read it.

setwd("C:/Users/slocatelli/Downloads/facebook-saralocatelli35/posts")
library(jsonlite)

FBposts <- fromJSON("your_posts.json")

This creates a large list object, with my data in a data frame. So as I did with the Taylor Swift albums, I can pull out that data frame.

myposts <- FBposts$status_updates

The resulting data frame has 5 columns: timestamp, which is in UNIX format; attachments, any photos, videos, URLs, or Facebook events attached to the post; title, which always starts with the author of the post (you or your friend who posted on your timeline) followed by the type of post; data, the text of the post; and tags, the people you tagged in the post.

First, I converted the timestamp to datetime, using the anytime package.

library(anytime)

myposts$timestamp <- anytime(myposts$timestamp)

Next, I wanted to pull out post author, so that I could easily filter the data frame to only use my own posts.

library(tidyverse)

myposts$author <- word(string = myposts$title, start = 1, end = 2, sep = fixed(" "))

Finally, I was interested in extracting URLs I shared (mostly from YouTube or my own blog) and the text of my posts, which I did with some regular expression functions and some help from Stack Overflow (here and here).

url_pattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"

myposts$links <- str_extract(myposts$attachments, url_pattern)

library(qdapRegex)

myposts$posttext <- myposts$data %>%
  rm_between('"','"',extract = TRUE)

There's more cleaning I could do, but this gets me a data frame I could use for some text analysis. Let's look at my most frequent words.

myposts$posttext <- as.character(myposts$posttext)
library(tidytext)
mypost_text <- myposts %>%
  unnest_tokens(word, posttext) %>%
  anti_join(stop_words)

## Joining, by = "word"

counts <- mypost_text %>%
  filter(author == "Sara Locatelli") %>%
  drop_na(word) %>%
  count(word, sort = TRUE)

counts

## # A tibble: 9,753 x 2
##    word         n
##    <chr>    <int>
##  1 happy     4702
##  2 birthday  4643
##  3 today's    666
##  4 song       648
##  5 head       636
##  6 day        337
##  7 post       321
##  8 009f       287
##  9 ð          287
## 10 008e       266
## # ... with 9,743 more rows

These data include all my posts, including writing "Happy birthday" on other's timelines. I also frequently post the song in my head when I wake up in the morning (over 600 times, it seems). If I wanted to remove those, and only include times I said happy or song outside of those posts, I'd need to apply the filter in a previous step. There are also some strange characters that I want to clean from the data before I do anything else with them. I can easily remove these characters and numbers with string detect, but cells that contain numbers and letters, such as "008e" won't be cut out with that function. So I'll just filter them out separately.

drop_nums <- c("008a","008e","009a","009c","009f")

counts <- counts %>%
  filter(str_detect(word, "[a-z]+"),
         !word %in% str_detect(word, "[0-9]"),
         !word %in% drop_nums)

Now I could, for instance, create a word cloud.

library(wordcloud)

counts %>%
  with(wordcloud(word, n, max.words = 50))

In addition to posting for birthdays and head songs, I talk a lot about statistics, data, analysis, and my blog. I also post about beer, concerts, friends, books, and Chicago. Let's see what happens if I mix in some sentiment analysis to my word cloud.

library(reshape2)

## 
## Attaching package: 'reshape2'

counts %>%
  inner_join(get_sentiments("bing")) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("red","blue"), max.words = 100)

## Joining, by = "word"

Once again, a few words are likely being misclassified - regression and plot are both negatively-valenced, but I imagine I'm using them in the statistical sense instead of the negative sense. I also apparently use "died" or "die" but I suspect in the context of, "I died laughing at this." And "happy" is huge, because it includes birthday wishes as well as instances where I talk about happiness. Some additional cleaning and exploration of the data is certainly needed. But that's enough to get started with this huge example of "me-search."

Tuesday, June 12, 2018

Beautiful Visualizations in R

I recently discovered the R Graph Gallery, where users can share the beautiful visualizations they've created using R and its various libraries (especially ggplot2). One of my favorite parts about this gallery is a section called Art From Data, in which users create works of art, sometimes with real data, and sometimes with a random number generator and a little imagination.

Last night, I completed a DataCamp project to learn how to draw flowers in R and ggplot2. Based on that, I created this little yellow flower:

Not only was the flower fun to create, it made me think about the data and how it would appear spatially. As I try to create new and more complex images, I have to keep building on and challenging those skills. It's a good exercise to get you thinking about data.

Sunday, June 10, 2018

Statistics Sunday: Creating Wordclouds

Cloudy with a Chance of Words Lots of fun projects in the works, so today's post will be short - a demonstration on how to create wordclouds, both with and without sentiment analysis results. While I could use song lyrics again, I decided to use a different dataset that comes with the quanteda packages: all 58 Inaugural Addresses, from Washington's first speech in 1789 to Trump's in 2017.

library(quanteda) #install with install.packages("quanteda") if needed

data(data_corpus_inaugural)
speeches <- data_corpus_inaugural$documents
row.names(speeches) <- NULL

As you can see, this dataset has each Inaugural Address in a column called "texts," with year and President's name as additional variables. To analyze the words in the speeches, and generate a wordcloud, we'll want to unnest the words in the texts column.

library(tidytext)
library(tidyverse)

speeches_tidy <- speeches %>%
  unnest_tokens(word, texts) %>%
  anti_join(stop_words)

## Joining, by = "word"

For our first wordcloud, let's see what are the most common words across all speeches.

library(wordcloud) #install.packages("wordcloud") if needed

speeches_tidy %>%
  count(word, sort = TRUE) %>%
  with(wordcloud(word, n, max.words = 50))

While the language used by Presidents certainly varies by time period and the national situation, these speeches refer often to the people and the government; in fact, most of the larger words directly reference the United States and Americans. The speeches address the role of "president" and likely the "duty" that role entails. The word "peace" is only slightly larger than "war," and one could probably map out which speeches were given during wartime and which weren't.

We could very easily create a wordcloud for one President specifically. For instance, let's create one for Obama, since he provides us with two speeches worth of words. But to take things up a notch, let's add sentiment information to our wordcloud. To do that, we'll use the comparison.cloud function; we'll also need the reshape2 library.

library(reshape2) #install.packages("reshape2") if needed

obama_words <- speeches_tidy %>%
  filter(President == "Obama") %>%
  count(word, sort = TRUE)

obama_words %>%
  inner_join(get_sentiments("nrc") %>%
               filter(sentiment %in% c("positive",
                                       "negative"))) %>%
  filter(n > 1) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("red","blue"))

## Joining, by = "word"

The acast statement reshapes the data, putting our sentiments of positive and negative as separate columns. Setting fill = 0 is important, since a negative word will be missing a value for the positive column and vice versa; without fill = 0, it would drop any row with NA in one of the columns (which would be every word in the set). As a sidenote, we could use the comparison cloud to compare words across two documents, such as comparing two Presidents. The columns would be counts for each President, as opposed to count by sentiment.

Interestingly, the NRC classifies "government" and "words" as negative. But even if we ignore those two words, which are Obama's most frequent, the negatively-valenced words are much larger than most of his positively-valenced words. So while he uses many more positively-valenced words than negatively-valenced words - seen by the sheer number of blue words - he uses the negatively-valenced words more often. If you were so inclined, you could probably run a sentiment analysis on his speeches and see if they tend to be more positive or negative, and/or if they follow arcs of negativity and positivity. And feel free to generate your own wordcloud: all you'd need to do is change the filter(President == "") to whatever President you're interested in examining (or whatever text data you'd like to use, if President's speeches aren't your thing).

Wednesday, June 6, 2018

The Importance of Training Data

In the movie Arrival, 12 alien ships visit Earth, landing at various locations around the world. Dr. Louise Banks, a linguist, is brought in to help make contact with the aliens who have landed in Montana.

With no experience with their language, Louise instead teaches the aliens, which they call heptapods, English while they teach her their own language. Other countries around the world follow suit. But China seems to view the heptapods with suspicion that turns into outright hostility later on.

Dr. Banks learns that China was using Mahjong to teach/communicate with the heptapods. She points out that using a game in this way changes the nature of the communication - everything becomes a competition with winners and loser. As they said in the movie, paraphrasing something said by Abraham Maslow, "If all I ever gave you was a hammer, everything is a nail."

Training material matters and can drastically affect the outcome. Just look at Norman, the psychopathic AI developed at MIT. As described in an article from BBC:

The psychopathic algorithm was created by a team at the Massachusetts Institute of Technology, as part of an experiment to see what training AI on data from "the dark corners of the net" would do to its world view.

The software was shown images of people dying in gruesome circumstances, culled from a group on the website Reddit.

Then the AI, which can interpret pictures and describe what it sees in text form, was shown inkblot drawings and asked what it saw in them.

Norman's view was unremittingly bleak - it saw dead bodies, blood and destruction in every image.

Alongside Norman, another AI was trained on more normal images of cats, birds and people.

It saw far more cheerful images in the same abstract blots.

The fact that Norman's responses were so much darker illustrates a harsh reality in the new world of machine learning, said Prof Iyad Rahwan, part of the three-person team from MIT's Media Lab which developed Norman.

"Data matters more than the algorithm."

This finding is especially important when you consider how AI has and can be used - for instance, in assessing risk of reoffending among parolees or determining which credit card applications to approve. If the training data is flawed, the outcome will be too. In fact, a book I read last year, Weapons of Math Destruction, discusses how decisions made by AI can reflect the biases of their creators. I highly recommend reading it!

Tuesday, June 5, 2018

Statistics "Sunday": More Sentiment Analysis Resources

I've just returned from a business trip - lots of long days, working sometimes from 8 am to 9 pm or 10 pm. I didn't get a chance to write this week's Statistics Sunday post, in part because I wasn't entirely certain what to write about. But as I started digging into sentiment analysis tools for a fun project I'm working on - and will hopefully post about soon - I found a few things I wanted to share.

The tidytext package in R is great for tokenizing text and running sentiment analysis with 4 dictionaries: Afinn, Bing, Loughran, and NRC. During some web searches for additional tricks to use with these tools, I found another R package: syuzhet, which includes Afinn, Bing, and NRC, as well as Syuzhet, developed in the Nebraska Literacy Lab, and a method to access the powerful Stanford Natural Language Processing and sentiment analysis software, which can predict sentiment through deep learning methods.

I plan to keep using the tidytext package for much of this analysis, but will probably draw upon the syuzhet package for some of the sentiment analysis, especially to use the Stanford methods. And there are still some big changes on the horizon for Deeply Trivial, including more videos and a new look!