Wednesday, December 14, 2016

Introducing Type Safety to Statistical Analysis: My Thoughts on the Issue

A friend shared the following article about type safety in statistical computing. Basically, the post by John Myles White introduces programming terminology of type safety, and argues that it should be applied to statistical analysis programs. For those who aren't familiar with this terminology, White explains:
Because every kind of data is ultimately represented as just a sequence of bits, it is always possible in theory to apply functions that only make sense for one interpretation of those bits (say an operation that computes the FFT of an audio file) to a set of bits that was intended to be interpreted in a different way (say a text file representing the Gettysburg Address). The fact that there seems to be no reasonable interpretation of the FFT of a text file that you decide to pretend is an audio file does not preclude the possibility of its computation.

The classic solution to this problem in computer science is to introduce more information about each piece of data in the system. In particular, the bits that represent the Gettysburg Address should be coupled to a type which indicates that those bits represent a string of characters and should therefore be viewed as text rather than as, say, audio or video. In general, programming languages with types can use type information to ensure that users cannot apply methods designed to process one kind of data to another incompatible kind of data that cannot be correctly processed in the same way.
What does this mean when applied to statistical analysis? White argues that types could be used to indicate the type of data being dealt with, how it was collected, and so on, to ensure that we are not applying a statistic for which we are violating key assumptions.

Every statistical analysis is based on assumptions about the data, usually about the data's distribution. For instance, when using a simple independent t-test, which compares two means to determine if they are reliably different from each other, one of the assumptions is that the variable being examined is normally distributed. The standard normal distribution looks like this:


Now, the distribution above is for population values. If we're working with a sample from this population, our distribution will look a little different, depending on how many people are in our sample - so we use the t-distribution, which is actually a set of curves drawn based on different sample sizes. (Fun side note: The t-distribution was developed by an employee of Guinness, who published it under the pseudonym "Student" because he didn't think people would respect statistical theory developed for studying beer. Frankly, that just makes me like it more.) You pick the curve based on your total sample size minus 2 (your degrees of freedom). Sample data are messy, and you're never going to have a distribution that perfectly conforms but as long as its close, you're okay. If it's not even close, then the results you get from your t-test won't be valid. That test is based on the assumption that your data looks more like what you see above than, say, this distribution:


Assumptions are important. The thing is, many statistical tests have a lot of them, and these assumptions are routinely violated. Fortunately, researchers often come along with data simulated to have certain issues (e.g., violate certain assumptions) and find that for many tests, results are pretty robust, even if you've violated assumptions. For instance, when you conduct an analysis of variance (ANOVA, which is used to compare more than 2 groups), your dependent variable (what you're comparing across your groups) is supposed to be continuous. However, many times I have seen people use ANOVA to analyze dichotomous data. Some purists will scream bloody murder when you do this. Others will say, "Meh, ANOVA is pretty robust." If these assumptions were laid out in type variables, computers would have no choice but to follow them. There would be no wiggle room.

Sure, statistics are misused constantly, and it's comforting to think there might be a way to protect people from Dunning-Kruger statisticians (people who know just enough to be dangerous but not enough to actually know what they're doing). But to make my point further, how about a more current debate I'm definitely part of?

As I've mentioned before, I'm a psychometrician by training, which is basically a measurement scientist. I prefer a particular approach to measurement called Rasch, which, to put it very simply, takes regular scale data and transforms it so that it is continuous. You see, Raschies like myself believe that a simple scale, such as this one:


is ordinal, not continuous. Ordinal means that there is an order to the choices (3 is more than 2) but that the distance between points varies. That is, the difference between neutral and agree might not be equal to the difference between agree and strongly agree. Only through transformation with the Rasch model (or whatever measurement model you prefer) can these values become continuous (equal interval). What this means is, if I were a purist Raschie, I would refuse to conduct a t-test, run a structural equation model, or even compute an average on the values above. I'd need to run my Rasch analyses first. You see, all those tests I just listed assume your data are continuous, not ordinal.

I know many of my statistically-inclined friends would think that logic is nuts, and would say, "What do mean 'this is ordinal data'? It's clearly continuous!" It's an issue under debate, and each side has good arguments. How, then, would this information be used in the type safety system White discusses? Whose side of the debate would the programmer choose? Honestly, we're better off with what we have, and we just need to 1) make sure people receive adequate training to know what they're doing statistically, 2) call people out when they clearly don't know what they're doing, and 3) let the scientific method do its job, by encouraging people to share as much information as possible so consumers of the research can evaluate the validity of the claims. You're going to have disagreement. Some will say, "You violated x assumption, your results are meaningless." Others will say, "Meh, that assumption is routinely violated and the test is robust." And that's okay! Disagreements and debates have inspired some really exceptional research.

No comments:

Post a Comment