And then, of course, there's the Cubs, who have been shown in previous statistical analyses to be the unluckiest team in baseball. There was a lot of evidence that this year would be the change Cubs fans could believe in. And last night, they made history by winning the World Series.
You may remember, though, that a few days ago I blogged that the Cubs had a 24% chance of winning it all. So I'm not too surprised that people have messaged me to say, "I guess the statistics were wrong. I mean, they won!" And yes, yes they did, and words can't express how awesome I think that is. But that doesn't mean the prediction was wrong.
Let's use another example, one we have all encountered. The weatherman tells you there is a 10% chance it will rain that day. It rains. You say, "Guess the weatherman was wrong." But technically, he was not. He said 10% chance, which means, it's unlikely but it could happen. In fact, the way they arrive at those figures is through statistical models, that demonstrate, given these conditions in other places and on other days, it rained about 10% of the time, and did not rain about 90% of the time. So somewhere, sometime, those specific conditions resulted in rain. If there's a non-zero chance of something happening, it could happen.
Whenever we conduct statistical analyses, we are interested in estimating the true population value. I say estimate because we can never really know the true population value, unless we measure every instance in the population. (And even then, there is the strong possibility of measurement error. So really, we need a psychic to divine the true population value.) We accept the fact that there will be some error - that is, some difference between the true value and the value we obtain in our analyses. As long as we conduct our analyses correctly and rigorously, and as long as we base those analyses on data that has been collected correctly and rigorously, we maximize the chance that our estimate will be close to the true population value. But there is a non-zero chance that it is wrong.
Predicting an outcome like rain or who will win the election or whether the Cubs will win the World Series requires complex models, because there are so many variables at play. We use a combination of theory, good guesses, and empirical evidence to determine what variables are good (that should be included) and what variables are irrelevant (that should be dropped). Empirical evidence comes in the form of explained variance. How much of the movement in an outcome (say, between win and loss) can be explained by this variable? Some variables explain more variance than others. Batting averages of Cubs team members explains more variance than say, what flavor Gatorade they have that day. So we choose those variables that explain large proportions of variance, to try to get the total explained variance as close to 100% as possible. The more variance we can explain, the more accurate our model will be at predicting outcomes.
But we can never know what variables will be the absolute best predictor, nor can we say anything about variables we don't or can't measure. So it's probable that the 24% estimate was wrong, and possible that it was very wrong. But simple knowledge of the outcome (they won) is not enough evidence to say the probability estimate was wrong. 24% isn't great odds, but it isn't zero. Unlikely things happen all the time. You might flip a coin and get 10 heads in a row. That's unlikely, but it could (and does) happen. Last night's game may have been one of the best demonstrations of probability in action I can think of.
At least, until next Tuesday. I'm still nervous about the election. But for now, I'm just going to enjoy the Cubs win a little longer.