## Thursday, April 6, 2017

### E is for Error

Yesterday, I talked about descriptive statistics, including measures of central tendency (such as the mean) and measures of variability (such as standard deviation). Let's say I give you the mean and standard deviation for the test I used in my caffeine study. If I point at a particular person in that sample and ask you to guess what score they got on the test, your best option would be the mean. That's what we use as the "typical" score. It's probably going to be the wrong guess, but in the absence of any additional information, it's your best bet.

Scores vary, of course. It's unlikely that two people in your sample will get the same score, let alone everyone in the sample. The point of statistics is find variables that explain that variance. The mean is the representative score for any individual person in the sample. So unless and until we find that variable that explains why scores differ from each other, we have to assume that any variation in scores (any deviation from the mean, our typical score) is a mistake. We call that error.

Any variation is error until we can explain it.

Obviously, this isn't actually true. People are different from each other, with different abilities, and there are logical explanations for why one person might get a perfect score on a test while another might get 50%. And it isn't just underlying ability that might affect how a person performs on a test. A host of environmental factors might also influence their scores. But statistically, we have to think of any variation in scores that we can't explain as error. Our inferential statistics are used to explain that variation - to take some of what we call error and relabel it as systematic, as having a cause. We can't measure everything, so in any study, we're going to have leftover variance that we simply call error.

So statistics is really about taking variation in scores and moving as much as of it as we can out of the error column and into the systematic (explained) variance column. If I can show that some of the variance is due to whether a person was allowed to drink caffeinated coffee before taking a test, I have found some evidence to support my hypothesis and have been able to move some of the variation into the explained column.

How exactly do we go about partitioning out this variance? Stay tuned! I'll get to that in a future post.