Sunday, December 24, 2017

Statistics Sunday: Introduction to Quantiles

During April A to Z, I devoted one of my posts to descriptive statistics - ways of summarizing your data, with statistics like the mean (a measure of central tendency) and standard deviation (a measure of variability). And I devoted another post to the histogram, which displays the distribution of a variable.

The histogram is built with frequencies: all of the values for a given variable in your dataset, and counts of the number of cases with that value. For instance, using the Caffeine study file (a randomly generated dataset to go with the fictional study I first discussed here), I could generate frequencies like this:

caffeine<-read.delim("caffeine_study.txt", header=TRUE, sep="\t")
table(caffeine$score)

64 69 71 72 73 75 76 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 94 

 1  2  2  2  1  4  4  7  1  3  2  2  3  6  1  4  5  2  2  3  1  1  1 

Not a very fancy-looking table, but it lets me see the frequency of each score in my dataset. (Note: There are ways of creating a much prettier table in R, but that's not the point of today's post.) As you can see, there are some possible scores missing, because the frequency of those values are 0. All of this information would be easier to see in a chart, of course. But frequencies are a good place to start to make sure you don't have any weird values.

But another way to describe data, which is based on frequencies, is in terms of percentiles - dividing the scores up into groups that reflect a certain percentage of the scores. We call these quantiles, and you can define those with whatever percentiles you want. But those values will be based on the scores that are actually in your dataset. Those scores with 0 frequency aren't counted toward the percentiles.

For example, let's say I want to divide my dataset up into 4, so each group encompasses 25% of the scores in my dataset. These are called quartiles. I can get these in R easily:

quantile(caffeine$score, c(.25,.50,.75,1)) 

 25%   50%   75%  100% 
76.00 82.00 86.25 94.00

What this tells me is that 25% of the scores are at or below 76, 50% are at or below 82, 75% are at or below 86.25, and 100% are at or below 94. Now, you might notice that the data are all whole numbers. So how can one of the results include a decimal? That's because the data didn't perfectly split into quartiles. There must have been some scores that straddled the line between one quartile and the next. In fact, if you're really curious how this works, you can compute quartiles pretty easily by hand. All you'd need to do is write out each score in numerical order, then divide it into 4 equal parts. The frequency table above would give you a start, but you'd need to add in the numbers with a frequency greater than 1. There are 60 scores, so each quartile would include 15 scores.

We deal with percentiles pretty regularly in our daily life, from a child's height and weight to performance on a test. All of these percentiles tell us the same thing - the percentage of scores at or below a certain (usually your) score.

Quantiles are a great way to summarize data, and they can be especially useful when summarizing data with a wide range. There's a great approach to linear regression that uses quantiles; look for a future blog post about quantile regression!

No comments:

Post a Comment