## Sunday, February 4, 2018

### Statistics Sunday: Quantile Regression (Part 2)

Last Sunday, I wrote about quantile regression and gave the basics of how it worked. I used the example of when I first analyzed data using quantile regression, in working with data following a growth curve. Specifically, I was analyzing data from an individually-administered adaptive language survey. The important thing to know about this kind of battery is that each test is one continuous unit, with items ranging from very easy to very difficult. But no one would complete all items; the test administrator would only have the examinee take however many items were necessary to determine their ability level. (Anyone administering this battery and others like it have specific training to help them understand how to do this.)

That ability level, of course, is determined based on which specific items the examinee answers correctly. But ability level is partially correlated with age - the easiest questions at the front of the booklet would generally be administered to the very young, while the most difficult questions at the back are for much older examinees. There's obviously individual variation within age groups, such that an older individual might only receive easier items and tap out before getting to more difficult items, or a younger person might be able to breeze through difficult items.

We were using quantile regression to help determine recommended starting points for a new test added to the battery. Instead of simply using age, we decided to use score on another test in the battery, a test that would be administered prior to administering this one, and that taps a very similar skillset. So we expected to have subgroups in our data, and those groups were strongly related to the score a given individual could achieve on the test; that is, one could only get a very high score if they responded correctly to the difficult items that are beyond most examinees' levels. Factor in that the relationship between our main predictor and our outcome was not linear but quadratic as well as the heteroskedasticity issue, and we had a situation in which quantile regression was a good approach. (As others have pointed out, there are others analysis methods we could have used. More on those in future posts.)

Today, I want to show you how to conduct quantile regression in R, using the quantreg package. I'll also be using the ggplot2 package, so if you want to follow along in R, you'll want to load (install if necessary) both of those packages. Also, the last time I used the quantreg package, I sometimes got my results in scientific notation, which I find difficult to read. If that's the case for you, I recommend suppressing scientific notation.

library(quantreg)
library(ggplot2)
options(scipen=999)

To make it easy for you to follow along, let's use an example dataset that comes with the quantreg package: engel. This dataset contains two variables: household income and annual food expenditures among Belgian working class families; both variables are in Belgian francs. Let's load this data and plot our two variables.

data(engel)
ggplot(engel, aes(foodexp,income)) + geom_point() + geom_smooth(method="lm")

The scores are clustered together at the lowest levels of income and food expenditures, and spread out more as values increase. (Note: If you look at the quantreg documentation on this dataset, you'll see they recommend displaying log-transformed data, which is a useful skill in many situations. Look for a future post on that.) Because we're violating a key assumption of least-squares regression, quantile regression is a good option; it doesn't have any assumptions about heteroskedasticity.

To run the quantile regression, you first want to define what quantiles you'll be using - that is, how many subgroups exist in your outcome variable. For my own data quandary at HMH, I set my number of subgroups as the number of test blocks: groups of items that align with the potential starting points. But you can set those quantiles to be whatever you would like. For instance, let's say you were interested in looking at percentiles by tenths. You could create the following R object, which gives you these percentiles (e.g., 0.1, 0.2, 0.3), that you can call with your quantile regression code:

qs <- 1:9/10

The code for a quantile regression is very simple. You use the function rq, and identify the regression equation in the Outcome~Predictor format. Next, refer to your selected quantiles as tau. Finally, identify your dataset.

qreg <- rq(foodexp~income, tau=qs, data=engel)

View simple results by just calling for your quantile regression object, qreg, or simple results plus confidence intervals with summary(qreg). If you want significance tests (testing whether the constant and slopes are significantly different from 0), use summary(qreg, se = "nid"). This will provide you useful information on how your quantiles differ from each other, such as whether some have nonsignificant slopes or whether slopes become negative for certain groups.

For brevity, I'll only display the simple results, but I'll mention that all constants and slopes are significant:

As you can see, you have information for a regression equation at each level of tau (your quantiles). We can easily plot this information by using the plot code above, with a simple substitution for the geom_smooth:

ggplot(engel, aes(foodexp,income)) + geom_point() + geom_quantile(quantiles=qs)

The nice thing about defining your quantiles with an object, rather than an embedded list in your quantile regression code, is that it's easy to switch out if you find different values would work better. For these data, it may not make sense to use these quantiles. The relationship appears to be linear and quite consistent across quantiles. So it may make more sense to use a single regression line, with tau set to the median:

qreg_med <- rq(foodexp~income, tau=0.5, data=engel)
ggplot(engel, aes(foodexp,income)) + geom_point() + geom_quantile(quantiles=0.5)

Hopefully this is enough to get you started with quantile regression. As I've mentioned before, I'm planning to do more analysis of real-world data, so I'm certain quantile regression will show up again. And since March will be here before we know it, it won't be long before Selection Sunday and March Madness. Perhaps some statistical bracketology is in order?