Monday, April 24, 2017

T is for T-Test

And now the long-awaited post about the intersection of two things I love: statistics and beer. In fact, as I was working on this post Sunday evening, I was enjoying a Guinness:

I'll get to why I specifically chose Guinness in a moment. But first, let's revisit our old friend, the standard normal distribution:

This curve describes the properties of a normally distributed variable in the population. We can determine the exact proportion of scores that will fall within a certain area of the curve. The thing is, this guy describes population-level data very well, but not so much with samples, even though the sample would be drawn from the population reflected in this curve. Think back to the post about population versus sample standard deviation; samples tend to have less variance than populations. The proportions in certain areas of the standard normal distribution are not just the number of people who fall in that range; they are also the probabilities that you will end up with a person falling within that range in your sample. So you have a very high probability of getting someone who falls in the middle, and a very low probability of getting someone who falls in one of the tails.

Your sample standard deviation is going to be an underestimate of the population standard deviation, so we apply the correction of N-1. The degree of underestimation is directly related to sample size - the bigger the sample, the better the estimate. So if you drew a normal distribution for your sample, it would look different depending on the sample size. As sample size increases, the distribution would look more and more like the standard normal distribution. But the areas under different parts of the curve (the probabilities of certain scores) would be different depending on sample size. So you need to use a different curve to determine your p-value depending on your sample size. If you use the standard normal distribution instead, your p-values won't be accurate.

In the early 1900s, a chemist named William Sealy Gosset was working at the Guinness Brewing Company. Guinness frequently hired scientists and statisticians, and even allowed their technical staff to take sabbaticals to do research - it's like an academic department but with beer. Gosset was dealing with very small samples in his research on the chemical properties of barley, and he needed a statistic (and distribution) that would allow him to conduct statistical analyses with a very small number of cases (sometimes as few as 3). Population-level tests and distributions would not be well-suited for such small samples, so Gosset used his sabbatical to spend some time at University College London, developed the t-test and t-distribution, and published his results to share with the world. (You can read the paper here.)

Every person who has taken a statistics course has learned about the t-test, but very few know Gosset's name. Why? Because he published the paper under the pseudonym "Student" and to this day, the t-test is known as Student's t-test (and the normal curves the Student's t-distribution). There are many explanations for this, and unfortunately, I don't know which one is accurate. I had always heard the first one, but as I did some digging, I found other stories:
  • Gosset feared people wouldn't respect a statistic created by a brewer, so he hid his identity
  • Guinness didn't allow its staff to publish
  • Guinness did allow staff to publish, but only under a pseudonym
  • Gosset didn't want competitors to know Guinness was using statistics to improve brewing
I'd like to show you a worked example, but since this post is getting long, I'm going to stop here. But I'll have a second post this afternoon showing a t-test in action (if you're into that kind of thing). Stay tuned!

No comments:

Post a Comment