Sunday, October 21, 2018

Statistics Sunday: What Fast Food Can Tell Us About a Community and the World

Two statistical indices crossed my inbox in the last week, both of which use fast food restaurants to measure a concept indirectly.

First up, in the wake of recent hurricanes, is the Waffle House Index. As The Economist explains:
Waffle House, a breakfast chain from the American South, is better known for reliability than quality. All its restaurants stay open every hour of every day. After extreme weather, like floods, tornados and hurricanes, Waffle Houses are quick to reopen, even if they can only serve a limited menu. That makes them a remarkably reliable if informal barometer for weather damage.

The index was invented by Craig Fugate, a former director of the Federal Emergency Management Agency (FEMA) in 2004 after a spate of hurricanes battered America’s east coast. “If a Waffle House is closed because there’s a disaster, it’s bad. We call it red. If they’re open but have a limited menu, that’s yellow,” he explained to NPR, America’s public radio network. Fully functioning restaurants mean that the Waffle House Index is shining green.
Next is the Big Mac Index, created by The Economist:
The Big Mac index was invented by The Economist in 1986 as a lighthearted guide to whether currencies are at their “correct” level. It is based on the theory of purchasing-power parity (PPP), the notion that in the long run exchange rates should move towards the rate that would equalise the prices of an identical basket of goods and services (in this case, a burger) in any two countries.
You might remember a discussion of the "basket of goods" in my post on the Consumer Price Index. And in fact, the Big Mac Index, which started as a way "to make exchange-rate theory more digestible," it's since become a global standard and is used in multiple studies. Now you can use it too, because the data and methodology have been made available on GitHub. R users will be thrilled to know that the code is written in R, but you'll need to use a bit of Python to get at the Jupyter notebook they've put together. Fortunately, they've provided detailed information on installing and setting everything up.

Sunday, October 14, 2018

Statistics Sunday: Some Psychometric Tricks in R

Statistics Sunday: Some Psychometrics Tricks in R It's been a long time since I've posted a Statistics Sunday post! Now that I'm moved out of my apartment and into my house, I have a bit more time on my hands, but work has been quite busy. Today, I'm preparing for 2 upcoming standard-setting studies by drawing a sample of items from 2 of our exams. So I thought I'd share what I'm up to in order to pass on some of these new psychometric tricks I've learned to help me with this project.

Because I can't share data from our item banks, I'll generate a fake dataset to use in my demonstration. For the exams I'm using for my upcoming standard setting, I want to draw a large sample of items, stratified by both item difficulty (so that I have a range of items across the Rasch difficulties) and item domain (the topic from the exam outline that is assessed by that item). Let's pretend I have an exam with 3 domains, and a bank of 600 items. I can generate that data like this:

domain1 <- data.frame(domain = 1, b = sort(rnorm(200)))
domain2 <- data.frame(domain = 2, b = sort(rnorm(200)))
domain3 <- data.frame(domain = 3, b = sort(rnorm(200)))

The variable domain is the domain label, and b is the item difficulty. I decided to sort that variable within each dataset so I can easily see that it goes across a range of difficulties, both positive and negative.

##   domain         b
## 1      1 -2.599194
## 2      1 -2.130286
## 3      1 -2.041127
## 4      1 -1.990036
## 5      1 -1.811251
## 6      1 -1.745899
##     domain        b
## 195      1 1.934733
## 196      1 1.953235
## 197      1 2.108284
## 198      1 2.357364
## 199      1 2.384353
## 200      1 2.699168

If I desire, I can easily combine these 3 datasets into 1:

item_difficulties <- rbind(domain1, domain2, domain3)

I can also easily visualize my item difficulties, by domain, as a group of histograms using ggplot2:

item_difficulties %>%
  ggplot(aes(b)) +
  geom_histogram(show.legend = FALSE) +
  labs(x = "Item Difficulty", y = "Number of Items") +
  facet_wrap(~domain, ncol = 1, scales = "free") +
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Now, let's say I want to draw 100 items from my item bank, and I want them to be stratified by difficulty and by domain. I'd like my sample to range across the potential item difficulties fairly equally, but I want my sample of items to be weighted by the percentages from the exam outline. That is, let's say I have an outline that says for each exam: 24% of items should come from domain 1, 48% from domain 2, and 28% from domain 3. So I want to draw 24 from domain1, 48 from domain2, and 28 from domain3. Drawing such a random sample is pretty easy, but I also want to make sure I get items that are very easy, very hard, and all the levels in between.

I'll be honest: I had trouble figuring out the best way to do this with a continuous variable. Instead, I decided to classify items by quartile, then drew an equal number of items from each quartile.

To categorize by quartile, I used the following code:

domain1 <- within(domain1, quartile <- as.integer(cut(b, quantile(b, probs = 0:4/4), include.lowest = TRUE)))

The code uses the quantile command, which you may remember from my post on quantile regression. The nice thing about using quantiles is that I can define that however I wish. So I didn't have to divide my items into quartiles (groups of 4); I could have divided them up into more or fewer groups as I saw fit. To aid in drawing samples across domains of varying percentages, I'd probably want to pick a quantile that is a common multiple of the domain percentages. In this case, I purposefully designed the outline so that 4 was a common multiple.

To draw my sample, I'll use the sampling library (which you'll want to install with install.packages("sampling") if you've never done so before), and the strata function.

domain1_samp <- strata(domain1, "quartile", size = rep(6, 4), method = "srswor")

The resulting data frame has 4 variables - the quartile value (since that was used for stratification), the ID_unit (row number from the original dataset), probability of being selected (in this case equal, since I requested equally-sized strata), and stratum number. So I would want to merge my item difficulties into this dataset, as well as any identifiers I have so that I can pull the correct items. (For the time being, we'll just pretend row number is the identifier, though this is likely not the case for large item banks.)

domain1$ID_unit <- as.numeric(row.names(domain1))
domain1_samp <- domain1_samp %>%
  left_join(domain1, by = "ID_unit")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

For my upcoming study, my sampling technique is a bit more nuanced, but this gives a nice starting point and introduction to what I'm doing.

Thursday, October 11, 2018

Can Characters of Various TV Shows Afford Their Lifestyles?

I certainly love analysis of pop culture data. So of course I have to share this fun bit of analysis I found on Apartment Therapy today: could the cast of Friends, How I Met Your Mother, or Seinfeld (to name a few) afford the lifestyles portrayed on these shows? Not really:
[T]he folks at Joybird decided to dive in a little bit deeper to see which TV characters could actually afford their lifestyles, and they determined that of the 30 characters analyzed, 60 percent could not afford their digs. Let's just say things are a bit more affordable in Hawkins, Indiana than on the Upper East Side.
Here are a few of their results:

Friday, October 5, 2018

Fall in Chicago: A Haiku

Broken umbrellas
Shoved into lonely trash bins
The wind spares no one

Thursday, October 4, 2018

Resistance is Futile

In yet another instance of science imitating science fiction, scientists figured out how to create a human hive mind:
A team from the University of Washington (UW) and Carnegie Mellon University has developed a system, known as BrainNet, which allows three people to communicate with one another using only the power of their brain, according to a paper published on the pre-print server arXiv.

In the experiments, two participants (the senders) were fitted with electrodes on the scalp to detect and record their own brainwaves—patterns of electrical activity in the brain—using a method known as electroencephalography (EEG). The third participant (the receiver) was fitted with electrodes which enabled them to receive and read brainwaves from the two senders via a technique called transcranial magnetic stimulation (TMS).

The trio were asked to collaborate using brain-to-brain interactions to solve a task that each of them individually would not be able to complete. The task involved a simplified Tetris-style game in which the players had to decide whether or not to rotate a shape by 180 degrees in order to correctly fill a gap in a line at the bottom of the computer screen.

All of the participants watched the game, although the receiver was in charge of executing the action. The catch is that the receiver was not able to see the bottom half of their screen, so they had to rely on information sent by the two senders using only their minds in order to play.

This system is the first successful demonstration of a “multi-person, non-invasive, direct, brain-to-brain interaction for solving a task,” according to the researchers. There is no reason, they argue, that BrainNet could not be expanded to include as many people as desired, opening up a raft of possibilities for the future.
Pretty cool, but...