Thursday, September 28, 2017

Statistical Sins: Nicolas Cage Movies Are Making People Drown and More Spurious Correlations

As I posted yesterday, I attended an all-day data science conference online. I have about 11 pages of typed notes and a bunch of screenshots I need to weed through, but I'm hoping to post more about the conference, my thoughts and what I learned, in the coming days.

At work, I'm knee-deep in my Content Validation Study. More on that later as well.

In the meantime, for today's (late) Statistical Sins, here's a great demonstration of why correlation does not necessarily infer anything (let alone causation). I can't believe I didn't discover this site before now: Spurious Correlations. Here are some of my favorites:

As I mentioned in a previous post, a correlation - even a large correlation - can be obtained completely by chance. Statistics are based entirely on probabilities, and there's always a probability that we can draw the wrong conclusion. In fact, in some situations, that probability may be very high (even higher than our established Type I error rate). 

This is a possibility we always have to accept; we may conduct a study and find significant results completely by chance. So we never want to take a finding in isolation too seriously. It has to be further studied and replicated. This is why we have the scientific method, which encourages transparency of methods and analysis approach, critique by other scientists, and replication.

But then there's times we just run analyses willy-nilly, looking for a significant finding. When it's done for the purpose of the Spurious Correlation website, it's hilarious. But it's often done in the name of science. As should be demonstrated above, we must be very careful when we go fishing for relationships in the data. The analyses we use will only tell us the likelihood we would find a relationship of that size by chance (or, more specifically, if the null hypothesis is actually true). It doesn't tell us if the relationship is real, no matter how small the p-value. When we knowingly cherry pick findings and run correlations at random, we invite spurious correlations into our scientific undertaking. 

This approach violates a certain kind of validity, often called statistical conclusion validity. We maximize this kind of validity when we apply the proper statistical analyses to the data and the question. Abiding by the assumptions of the statistic we apply is up to us. The statistics don't know. We're on the honor system here, as scientists. Applying a correlation or any statistic without any kind of prior justification to examine that relationship violates assumptions of the test.

So I'll admit, as interested as I am in the field of data science, I'm also a bit concerned about the high use of exploratory data analysis. I know there are some controls in place to reduce spurious conclusions, such as using separate training and test data, so I'm sure as I find out more about this field, I'll become more comfortable with some of these approaches. More on that as my understanding develops.

1 comment:

  1. The causation in these correlations are quite obvious to me.
    1. The more nerdy kids who play arcade video games, the more those kids get interested in computers and seek degrees. Although I suppose there would be a lag there.
    2. The more films Cage puts out, the more people are either so distracted thinking about them or feel impervious to harm that they fall into ponds.
    3. The longer you stand still, the more likely a spider will bite you. So if you are standing still longer to spell longer words...well, clearly spelling bees are dangerous.