## Friday, August 25, 2017

### Statistical Sins Late Edition: Three Things We Love

The eclipse was amazing, but after missing 2 days of work this week, playing catch-up Wednesday, and attending an all-day meeting yesterday, I was unable to get myself together and write a Statistical Sins post for Wednesday (or even yesterday). (I did, however, get around to posting a Great Minds in Statistics post on the amazing F.N. David. I've had that post scheduled for a while now.)

I'll admit, part of the problem, that was compounded by lack of time, was not knowing what to write about. But a story that is making the rounds again and made it's way into my news feed is a study from the New England Journal of Medicine regarding a country's overall chocolate consumption and its number of Nobel Prize laureates.

Apparently the correlation is a highly significant 0.791. While the authors get that this doesn't imply a causal relationship, they sort of miss the boat here:
Of course, a correlation between X and Y does not prove causation but indicates that either X influences Y, Y influences X, or X and Y are influenced by a common underlying mechanism.
So that's three possibilities: A causes B, B causes A, or C causes A and B, or what is known as the third variable problem. But they miss the fourth possibility: A and B are two random variables that by chance alone have a significant relationship. There might not be a meaningful C variable.

To clarify, when I say "random variable," I mean a variable that is allowed to vary naturally - we're not actively introducing any interventions to increase the number of Nobel laureates in any country (which in light of this study would probably involve airlifting chocolate in). And when we allow variables to vary naturally, we'll sometimes find relationships between them. That could occur just by chance. In my correlation post linked above, I generated 20 random samples of 30 pairs of variables, and found 3 significant correlations (all close to r = 0.4) by chance alone.

Sure, this is a significant relationship - a highly significant one at that - but there isn't some level of significance where a relationship suddenly goes from being potentially due to chance alone to being absolutely systematic or real. To argue that a relationship of 0.7 can't be due to chance makes no more sense than saying a relationship of 0.1 can't be due to chance. There's a chance I could create two random variables and have them correlate at 1.0, a perfect relationship. It's a small chance, but the chance is never 0. There's no magic cutoff value where we throw out the possibility of Type I error. And the p-value generated by an analysis is not the chance that a result is spurious; it's the chance we would find a relationship of that size by chance alone given what we know about the potential distribution of the variables interest - and what we know about the distribution comes from the very sample data we're speculating about. It's possible the distributions look completely different from what we expect, making the probability of Type I error higher than we realize. (In fact, see this post on Bayes theorem about how the false positive rate is likely much higher than alpha.)

It occurs to me that there are three consumables that people love so much, they keep looking for data that will justify our love of them. Those three things are coffee, chocolate, and bacon.

And the greatest of these is bacon.

It's true though. When we're not publishing stories about how chocolate or coffee benefits your health, we're attempting to disprove those evil scientists who try to convince us bacon is harmful.

Loving these things likely motivates us to study them. And sometimes that involves looking for a relationship - any relationship - with a positive outcome. Observational studies can very easily uncover spurious relationships. Increasing the distance (e.g., looking at country level data) between the effect (e.g., consumption of chocolate) and the outcome (e.g., Nobel prize) can drastically increase the probability that we find a false positive.

I bet you can find many significant relationships - even highly significant relationships - when looking at two variables from the altitude of country-level data. More complicated relationships get washed out when viewing the relationship so far away from individual-level data. In fact, when we remove variance - either by aggregating data across many people (as occurs in country-level data) or by recoding continuous variables into dichotomies - we may miss confounds or other variables that provide a much better explanation of the findings. We miss the signs that we're barking up the wrong tree.