Wednesday, April 5, 2017

D is for Descriptive Statistics

You're out for coffee with a friend. You start talking about a new book you just finished. Your friend asks you to tell her about the book. Do you a) pull the book out of your purse and start reading it to her or b) give her a brief summary?

Yes, I know that's a silly question. Obviously it's a.

Kidding. Of course you would give her a summary. No friend - no matter how patient - is going to sit there and listen to you read her the entire book, even a short one. That's simply too much information to answer her question.

That's essentially what statistics are: a way of summarizing large amounts of information, to give people the most important pieces. In fact, many people divide statistics up into two types. The first, which I'll talk about in more detail today, is descriptive statistics. The second is inferential statistics, which I'll talk more about later, but the purpose is to explore relationships and find explanations for different effects.

Descriptive statistics, as the name implies, describe your data. That is, they are used to quickly summarize the most important pieces of information about the variables you measured. You've probably encountered many of these statistics.

In fact, because we statisticians love counting and subdividing things, we tend to divide descriptive statistics up into two types. First are measures of central tendency, a fancy way of saying that you're describing typical case. The measure of central tendency you know best is the average, or the mean as it's called in statistics. You get it by adding all of the scores up and dividing by the number of scores. In our ongoing caffeine study, we would probably report the average score for our sample, as well as by group (experimental or control). That tells us a lot about what we need to know about our sample.

Average isn't always the best measure of central tendency to report. What if your data are in the form of categories, like gender? There isn't an average gender, but you could still report the proportions of men and women. And you could tell us the typical gender of your sample by reporting which gender is more frequent. This is called the mode: the most frequently occurring category.

The last measure of central tendency is the median, which divides your distribution in half. Basically, if you were to line up all your scores in numerical order, it would be the score in the very middle. This measure is best used when your data has outliers, scores that are so far away from the rest of your group that including them in the average would skew your results. Think of it as trying to report the average salary of a group of people when you have one billionaire in the bunch. The average wouldn't make sense because it's not going to represent anyone in your group well. The median reduces the influence of very high or very low scores on the ends.

So let's say the average test score for our caffeine study sample is 85%. This tells you the typical person got a B, but doesn't tell you everything you need to know. It could be that everyone got a B. It could be that half of the people got an A+ and the other half got a C-. Or there could be grades ranging all the way from A+ to F. You want to know how spread out the scores are. For this information, we use measures of variability.

The first measure of variability is the easiest: range, which gives the highest and lowest scores. It's also the least useful. If you think back to the billionaire example above, one outlier can make your scores look a lot more spread out than they actually are.

That's why we have two other measures of variability, which I'll talk about together (and you'll see why in a moment). The first is variance. You need variance to compute the other measure, standard deviation. Standard deviation tells you how much your scores deviate from the mean on average. So the first step in computing standard deviation is to take each score in the sample and subtract the mean from it. But you can't just add all those deviations up and divide by the number of scores. The mean is a balancing point; it's designed to give you the approximate center. Some deviations will be positive (higher than the mean), and some will be negative (lower than the mean). If you add all of the deviations together, they'll probably add up to 0 (or pretty close). So after you compute your deviations, you square them so they're all positive. Those squared deviations are used to compute your variance, the average squared deviation from the mean. Then you take the square root of the variance to get standard deviation.

Do you see why I talked about them together? Variance is important and it's used in a lot of analyses, but as a measure of variability, people tend to rely on standard deviation. Squared scores and deviations are just a little harder to wrap our heads around.

So now, if I give you the average test score and a standard deviation, you've got a better idea of what the distribution of scores looks like. If the average score is 85 and the standard deviation is 2, you know that this is a moderately easy test for most people (just about everyone got the same score). But if the average score is 85 and the standard deviation is 20, you know the test ranges from super easy to super challenging for different people.

I'll talk about our good friend the normal distribution in one of these posts, and then you'll really see how beneficial standard deviation is. Because if something is normally distributed and you know the standard deviation, you can very easily figure out what percentages of scores fall at certain values.

No comments:

Post a Comment