Monday, December 10, 2018

Amazing Demonstration of the Digits of Pi

You probably know that pi made up of a series of digits that theoretically go on forever. Scientists have computers calculating the digits of pi, making it up to 2 quadrillion! Today, I found this great GIF that shows the count of digits (0 through 9) in the first 1000 digits of pi - pretty cool:


Sunday, November 25, 2018

Statistics Sunday: Introduction to Regular Expressions

In my last Statistics Sunday post, I briefly mentioned the concept of regular expressions, also known as regex (though note that in some contexts, these refer to different things - see here). A regular expression is a text string, which you ask your program to match. You can use this to look for files with a particular name or extension, or search a corpus of text for a specific word or word(s) that match a certain pattern. This concept is used in many programming languages, and R is no exception. In order to use a regular expression, you need a regular expression engine. Fortunately, R base and many R packages come with a regular expression engine, which you call up with a function. In R base, those functions are things like grep, grepl, regexpr, and so on.

Today, I want to talk about some of the syntax you use to describe different patterns of text characters, which can include letters, digits, white space, and punctuation, starting from the most simple regular expressions. Typing out a specific text string matches only that. In last week's example, I was working with files that all start with the word "exam." So if I had other files in that folder that didn't follow that naming convention, I could look specifically for those by simply typing exam into the pattern attribute of my list.files() function. These files follow exam with a date, formatted as YYYYMMDD, and a letter to delineate each file from that day (a to z). Pretty much all of our files end in a, but in the rare instances where they send two files for a day, that one would end in b. How can I write a regular expression that matches this pattern?

First, the laziest approach, using the * character (which is sometimes called a greedy operator), which tells the regular expression engine to look for any number of characters following a string. For that, I could use this pattern:

list.files(pattern = "exam*")

A less lazy approach that still uses the + greedy operator would be specify the numbers I know will be the same for all files, which would be the year, 2018. In fact, if all of my exam files from every year were in a single folder, I could use that to select just one year:

list.files(pattern = "exam2018*")

Let's say I want all files from October 2018. I could do this:

list.files(pattern = "exam201810*")

That might be as fancy as I need to get for my approach.

We can get more specific by putting information into brackets, such as [a-z] to tell the program to match something with a lowercase letter in that range or [aA] to tell the program to look for something with either lowercase or uppercase a. For example, what if I wanted to find every instance of the word "witch" in The Wizard of Oz? Sometimes, the characters refer to a specific witch (Good Witch, Bad Witch, and so on) or to the concept of being a witch (usually lowercase, such as "I am a good witch"). I could download The Wizard of Oz through Project Gutenberg (see this post for how to use the guternbergr package; the ID for The Wizard of Oz is 55), then run some text analysis to look for any instance of Witch or witch:

witch <- WoO[grep("[Ww]itch", WoO$text),]

Here's what the beginning of the witch dataset looks like:

> witch
# A tibble: 141 x 2
   gutenberg_id text     
         <int> <chr>                                                                   
1           55 "  12.  The Search for the Wicked Witch"                                 
2           55 "  23.  Glinda The Good Witch Grants Dorothy's Wish"                     
3           55 We are so grateful to you for having killed the Wicked Witch of the     
4           55 killed the Wicked Witch of the East?  Dorothy was an innocent, harmless 
5           55 "\"She was the Wicked Witch of the East, as I said,\" answered the little"
6           55 " where the Wicked Witch ruled.\""   
7           55 When they saw the Witch of the East was dead the Munchkins sent a swift   
 8           55 "messenger to me, and I came at once.  I am the Witch of the North.\""   
 9           55 "\"Oh, gracious!\" cried Dorothy.  \"Are you a real witch?\""             
10           55 "\"Yes, indeed,\" answered the little woman.  \"But I am a good witch, and"
# ... with 131 more rows

I'm working on putting together a more detailed post (or posts) about regular expressions, including more complex examples and the components of a regular expression, so check back for that soon!

Sunday, November 18, 2018

Statistics Sunday: Reading and Creating a Data Frame with Multiple Text Files

First Statistics Sunday in far too long! It's going to be a short one, but it describes a great trick I learned recently while completing a time study for our exams at work.

To give a bit of background, this time study involves analzying time examinees spent on their exam and whether they were able to complete all items. We've done time studies in the past to select time allowed for each exam, but we revisit on a cycle to make certain the time allowed is still ample. All of our exams are computer-administered, and we receive daily downloads from our exam provider with data on all exams administered that day.

What that means is, to study a year's worth of exam data, I need to read in and analyze 365(ish - test centers are generally closed for holidays) text files. Fortunately, I found code that would read all files in a particular folder and bind them into a single data frame. First, I'll set the working directory to the location of those files, and create a list of all files in that directory:

setwd("Q:/ExamData/2018")
filelist <- list.files()

For the next part, I'll need the data.table library, which you'll want to install if you don't already have it:

library(data.table)
Exams2018 <- rbindlist(sapply(filelist, fread, simplify = FALSE), use.names = TRUE, idcol = "FileName")

Now I have a data frame with all exam data from 2018, and an additional column that identifies which file a particular case came from.

What if your working directory has more files than you want to read? You can still use this code, with some updates. For instance, if you want only the text files from the working directory, you could add a regular expression to the list.files() code to only look for files with ".txt" extension:

list.files(pattern = "\\.txt$")

If you're only working with a handful of files, you can also manually create the list to be used in the rbindlist function. Like this:

filelist <- c("file1.txt", "file2.txt", "file3.txt")

That's all for now! Hope everyone has a Happy Thanksgiving!


Friday, November 16, 2018

Great Post on Using Small Sample Sizes to Make Decisions

It's been a busy month with little time for blogging, but I'm planning to get back on track soon. For now, here's a great post on the benefits of using small samples to inform decisions:
When it comes to statistics, there are a lot of misconceptions floating around. Even people who have scientific backgrounds subscribe to some of these common misconceptions. One misconception that affects measurement in virtually every field is the perceived need for a large sample size before you can get useful information from a measurement.

[I]f you can learn something useful using the limited data you have, you’re one step closer to measuring anything you need to measure — and thus making better decisions. In fact, it is in those very situations where you have a lot of uncertainty, that a few samples can reduce uncertainty the most. In other words, if you know almost nothing, almost anything will tell you something.
The article describes two approaches - the rule of five (taking a random sample of 5 to draw conclusions) or the urn of mystery (that a single case from a population can tell you more about the makeup of that population). The rule of five seems best when trying to get a continuous value (such as, in the example from the post, the average commute time of workers in a company), while the urn of mystery seems best when trying to determine if a population is predominantly one of two types (in the post, the example is whether an urn of marbles contains predominantly marbles of a certain color).

Obviously, there are times when you need more data. But if you're far better off making decisions with data (even very little) than with none at all.

Sunday, October 21, 2018

Statistics Sunday: What Fast Food Can Tell Us About a Community and the World

Two statistical indices crossed my inbox in the last week, both of which use fast food restaurants to measure a concept indirectly.

First up, in the wake of recent hurricanes, is the Waffle House Index. As The Economist explains:
Waffle House, a breakfast chain from the American South, is better known for reliability than quality. All its restaurants stay open every hour of every day. After extreme weather, like floods, tornados and hurricanes, Waffle Houses are quick to reopen, even if they can only serve a limited menu. That makes them a remarkably reliable if informal barometer for weather damage.

The index was invented by Craig Fugate, a former director of the Federal Emergency Management Agency (FEMA) in 2004 after a spate of hurricanes battered America’s east coast. “If a Waffle House is closed because there’s a disaster, it’s bad. We call it red. If they’re open but have a limited menu, that’s yellow,” he explained to NPR, America’s public radio network. Fully functioning restaurants mean that the Waffle House Index is shining green.
Next is the Big Mac Index, created by The Economist:
The Big Mac index was invented by The Economist in 1986 as a lighthearted guide to whether currencies are at their “correct” level. It is based on the theory of purchasing-power parity (PPP), the notion that in the long run exchange rates should move towards the rate that would equalise the prices of an identical basket of goods and services (in this case, a burger) in any two countries.
You might remember a discussion of the "basket of goods" in my post on the Consumer Price Index. And in fact, the Big Mac Index, which started as a way "to make exchange-rate theory more digestible," it's since become a global standard and is used in multiple studies. Now you can use it too, because the data and methodology have been made available on GitHub. R users will be thrilled to know that the code is written in R, but you'll need to use a bit of Python to get at the Jupyter notebook they've put together. Fortunately, they've provided detailed information on installing and setting everything up.


Sunday, October 14, 2018

Statistics Sunday: Some Psychometric Tricks in R

Statistics Sunday: Some Psychometrics Tricks in R It's been a long time since I've posted a Statistics Sunday post! Now that I'm moved out of my apartment and into my house, I have a bit more time on my hands, but work has been quite busy. Today, I'm preparing for 2 upcoming standard-setting studies by drawing a sample of items from 2 of our exams. So I thought I'd share what I'm up to in order to pass on some of these new psychometric tricks I've learned to help me with this project.

Because I can't share data from our item banks, I'll generate a fake dataset to use in my demonstration. For the exams I'm using for my upcoming standard setting, I want to draw a large sample of items, stratified by both item difficulty (so that I have a range of items across the Rasch difficulties) and item domain (the topic from the exam outline that is assessed by that item). Let's pretend I have an exam with 3 domains, and a bank of 600 items. I can generate that data like this:

domain1 <- data.frame(domain = 1, b = sort(rnorm(200)))
domain2 <- data.frame(domain = 2, b = sort(rnorm(200)))
domain3 <- data.frame(domain = 3, b = sort(rnorm(200)))

The variable domain is the domain label, and b is the item difficulty. I decided to sort that variable within each dataset so I can easily see that it goes across a range of difficulties, both positive and negative.

head(domain1)
##   domain         b
## 1      1 -2.599194
## 2      1 -2.130286
## 3      1 -2.041127
## 4      1 -1.990036
## 5      1 -1.811251
## 6      1 -1.745899
tail(domain1)
##     domain        b
## 195      1 1.934733
## 196      1 1.953235
## 197      1 2.108284
## 198      1 2.357364
## 199      1 2.384353
## 200      1 2.699168

If I desire, I can easily combine these 3 datasets into 1:

item_difficulties <- rbind(domain1, domain2, domain3)

I can also easily visualize my item difficulties, by domain, as a group of histograms using ggplot2:

library(tidyverse)
item_difficulties %>%
  ggplot(aes(b)) +
  geom_histogram(show.legend = FALSE) +
  labs(x = "Item Difficulty", y = "Number of Items") +
  facet_wrap(~domain, ncol = 1, scales = "free") +
  theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Now, let's say I want to draw 100 items from my item bank, and I want them to be stratified by difficulty and by domain. I'd like my sample to range across the potential item difficulties fairly equally, but I want my sample of items to be weighted by the percentages from the exam outline. That is, let's say I have an outline that says for each exam: 24% of items should come from domain 1, 48% from domain 2, and 28% from domain 3. So I want to draw 24 from domain1, 48 from domain2, and 28 from domain3. Drawing such a random sample is pretty easy, but I also want to make sure I get items that are very easy, very hard, and all the levels in between.

I'll be honest: I had trouble figuring out the best way to do this with a continuous variable. Instead, I decided to classify items by quartile, then drew an equal number of items from each quartile.

To categorize by quartile, I used the following code:

domain1 <- within(domain1, quartile <- as.integer(cut(b, quantile(b, probs = 0:4/4), include.lowest = TRUE)))

The code uses the quantile command, which you may remember from my post on quantile regression. The nice thing about using quantiles is that I can define that however I wish. So I didn't have to divide my items into quartiles (groups of 4); I could have divided them up into more or fewer groups as I saw fit. To aid in drawing samples across domains of varying percentages, I'd probably want to pick a quantile that is a common multiple of the domain percentages. In this case, I purposefully designed the outline so that 4 was a common multiple.

To draw my sample, I'll use the sampling library (which you'll want to install with install.packages("sampling") if you've never done so before), and the strata function.

library(sampling)
domain1_samp <- strata(domain1, "quartile", size = rep(6, 4), method = "srswor")

The resulting data frame has 4 variables - the quartile value (since that was used for stratification), the ID_unit (row number from the original dataset), probability of being selected (in this case equal, since I requested equally-sized strata), and stratum number. So I would want to merge my item difficulties into this dataset, as well as any identifiers I have so that I can pull the correct items. (For the time being, we'll just pretend row number is the identifier, though this is likely not the case for large item banks.)

domain1$ID_unit <- as.numeric(row.names(domain1))
domain1_samp <- domain1_samp %>%
  left_join(domain1, by = "ID_unit")
qplot(domain1_samp$b)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

For my upcoming study, my sampling technique is a bit more nuanced, but this gives a nice starting point and introduction to what I'm doing.

Thursday, October 11, 2018

Can Characters of Various TV Shows Afford Their Lifestyles?

I certainly love analysis of pop culture data. So of course I have to share this fun bit of analysis I found on Apartment Therapy today: could the cast of Friends, How I Met Your Mother, or Seinfeld (to name a few) afford the lifestyles portrayed on these shows? Not really:
[T]he folks at Joybird decided to dive in a little bit deeper to see which TV characters could actually afford their lifestyles, and they determined that of the 30 characters analyzed, 60 percent could not afford their digs. Let's just say things are a bit more affordable in Hawkins, Indiana than on the Upper East Side.
Here are a few of their results: