Wednesday, August 2, 2017

Statistical Sins: Regression to the Mean

As I was looking for something to blog about for today's Statistical Sins post, I happened upon an article about a new study examining bee behaviors and potential genetics links to autism-like disorders. The researchers exposed bees to two social situations, one that should elicit an aggressive response (an outsider bee in the hive) and one that should elicit a parental response (appearance of a queen larva). Though most bees reacted to at least one of the situations, 14% had no reaction at all.

They then went on to examine the genetic profiles of the bees that responded to one of the situations (either "guards" or "nurses") and the bees that didn't respond at all. Specifically, they examined the "mushroom bodies" - a part of an insect's brain involved in integrating sensory information as well as social behavior.

They compiled a list of all genes that expressed differently (which they call DEGs, differently expressed genes) between the responders and the non-responders, and used an analysis, called principal components analysis, that looks for common patterns among variables to identify groups that "hang together" (that is, they appear to have a similar underlying cause). Then they further examined the 50 DEGs that loaded most highly on the analysis (hang most strongly together). Finally, they compared these genes to a list of genes implicated in autism-spectrum disorder in humans and found significant overlap.

I should emphasize that I'm not an entomologist, geneticist, or clinical psychologist with expertise in autism, so this research is outside of my area. But it seemed like a great way to introduce a concept that could have impacted their results: regression to the mean. In fact, interestingly, this statistical concept was first introduced by Francis Galton, who used genetic information - which he did mainly to try to explain a contradiction in a theory from his cousin, who you might have heard of: Charles Darwin.

Darwin's theory of natural selection explains why evolution occurs. Within a species, we have genetic mutations. Some mutations are good and are "selected," meaning they increase probability of survival. These genes are then passed on to their children. Over time, new species emerge when a good trait is selected for again and again, to the point that organisms with this trait are very different from organisms without it. So genetic variation is good. But, Galton observed, if organisms keep changing through genetic variation, how do stable species emerge? Why don't they just keep changing?

So Galton looked at data from a large sample of parents and their children. He averaged the two parents' heights together, then compared those meta-parent heights to the heights of their children. Height, like many variables, is normally distributed, so you can probably imagine what the distribution looked like. He found that short meta-parents tended to have taller children (closer to the mean) and tall meta-parents tended to have shorter children (again, closer to the mean). So there was a trend for parents who were in the extremes to have children closer to the middle of the distribution.

Remember that the mean is the expected value. In a normally distributed variable, it is also the most frequent value (mode) and the value that divides the distribution in half (median). 68% of people will fall within 1 SD, so the majority of people will fall in or around the middle. The probability that a child, regardless of how tall his or her parents are, will fall around the average is high. And people who fall in the extremes are unusual, potentially even a fluke - a fluke that is unlikely to be repeated. There's just not that many of them, so even if they do pass on their extreme height (very tall or very short), that's going to result in a small group of extreme children. There's a lot more average people, who will generally produce average children, but may produce the occasional extreme case. So over time, we see a distribution of height that remains pretty stable.

This principle, which Galton referred to as regression, has been observed in many situations. Basically, this phenomenon occurs when a case is selected because of its extreme value on some variable. But because the mean is much more likely, the next time you measure that variable, the value will be closer to the mean. Extreme values are unlikely, and thus, less likely to occur again. This concept has been used to explain the "Sports Illustrated Effect" - when athletes at the top of their game see a decline in performance after being featured on the cover of Sports Illustrated. Since sports performance is very much driven by probability, it makes sense that people selected because they fall in the extreme of the sports performance distribution will move to a less extreme score.

How does this concept relate here? The researchers showed a tendency to choose cases or variables because they fell in the extremes. They picked the bees as nonresponders for falling in the extremes, then genes to further examine because they showed stronger relationships, and so on. To be fair, they also included less extreme bee cases, and later on, did some analysis of the genes that did not emerge as strong contenders. Still, you have to be careful when selecting a case simply because it is extreme, especially if you plan on doing something to the case (e.g., an intervention) to make it less extreme - like selecting low performers on an achievement test to receive extra training. Chances are, their score would have increased either way.

No comments:

Post a Comment