Wednesday, July 12, 2017

Statistical Sins: Too Many Tests

Today, I read the eye-catching headline: Just one night of poor sleep can boost Alzheimer’s proteins. Needless to say, I clicked on the article. And then I clicked on the original study. Unfortunately, as interesting as the topic and findings are, there are some serious statistical issues with the study. I'll get to them shortly.

First, to summarize the actual study. Apparently, a protein called amyloid-beta can build up into plaques that lead to brain cell death. The build-up of this plaque has been shown to be an early and necessary step for the development of Alzheimers disease. So if we can prevent this plaque buildup, we can potentially stave off Alzheimers. Previous research has shown a relationship between poor sleep and buildup of this protein. This study found that experimentally-induced poor sleep - specifically, poor sleep during the deep sleep stages - increases amyloid-beta levels after just one night.

An interesting study, but some key issues. First, some kudos - though the study was very small (22 participants), they did do a power analysis. This is unusual, or at least, reporting the results of a power analysis in a journal article is unusual. But they conducted the power analysis to detect what they call a moderate correlation, specifically 0.7. That's actually a huge correlation, translating to almost 50% shared variance between the two variables. Most conventions call a moderate correlation to be between 0.3 and 0.5. The results of this power analysis was that only 20 participants were needed to detect this moderate huge result. So their sample size is good then? It would be, if they hadn't had to drop data from 5 participants. In fact, they planned to stop data collection after 20 but "accidentally" collected a couple more. They built in no padding for dropouts or data problems, hence, that lovely power analysis they conducted didn't actually result in them getting the sample they needed.

They collected tons of data from these 22 participants, which is typical for these types of studies. Rather than having many participants from which you collect a small amount of data, neuroscience studies collect large amounts of data from a small number of participants. This results in 25(!) significant tests, each with an alpha of 0.05. It's not terribly surprising they found significant results. With that much Type I error rate inflation, I'd be surprised if they didn't find something.

They fortunately realize some of the problems with sample size, as can be seen in their rather long weaknesses section. But they use the old "but we found significant results so it must not have been a problem" argument. The thing is, while small sample sizes can result in low power, they can also lead to erroneous significant results because of the weird things probability can do - variance stabilizes as sample size increases, but in small samples, it can be quite volatile. If that high variance shows up in the same pattern you hypothesize, you'll get significant results. But that doesn't make them real results.

Be sure to tune in for Statistics Sunday, where I'll dig into Type I error more fully - and show what Bayes' theorem can teach us about Type I error rate!

No comments:

Post a Comment