Sunday, January 28, 2018

Statistics Sunday: Quantile Regression (Part 1)

About a month ago, I introduced the concept of quantiles - ways of dividing up the distribution of scores into certain percentile groups. The median is one of the most well-known quantiles - it is the score that divides the distribution of scores in half, so that half of the scores in the distribution are below it and half are above. Unlike mean and standard deviation, which are most useful when scores fall along a normal distribution, quantiles can be used to describe just about any distribution (although visualization of your data distribution is still very important).

Regression is a useful technique when you want to predict an outcome based on one or more variables thought to influence that outcome. The most well-known type of regression is linear regression, where the relationship between two variables can be plotted on scatterplot with a straight line. So a linear relationship is one important assumption when using linear regression. You wouldn't want to use it if you believe the relationship between your two variables follows a curve.

But linear regression has other assumptions, one of which is that the variance of scores is consistent across the line. For instance, take a look at this scatterplot with a prediction line drawn through it:

This is the scatterplot I used in my scatterplot post, which was created using data from my study of Facebook use. Most of the dots don't fall directly on the line. And that's to be expected, because even strong predictors can't predict an outcome perfectly. When we say the relationship is linear, we are referring to the trend. But scores are going to vary and differ from our predicted score - sometimes because of other variables we didn't measure, and sometimes because of idiosyncrasies in the data. But to use linear regression, we have to meet an assumption called "homoscedasticity." That means that the variance of scores at one point on the line should be equal to the distribution of scores at other points along the line. You would want the scatter of the scatterplot to be fairly consistent across the chart. If you have a plot that looks more like this:

you're likely violating an assumption. The scores are more spread out in the upper-right corner than they are in the lower-left corner.

When I was a psychometrician at Houghton Mifflin Harcourt, we regularly worked with data that had a curved relationship between variables. That is, rather than following a straight line, the best way to describe the relationship between two variables was with a curved line - specifically, a growth curve, where we see rapid increases between two correlated variables, followed by a flattening out. There is a type of regression - polynomial regression - that allows us to model curvilinear relationships. But there was another interesting observation in these growth curves: the variance in scores at the low end was much greater than at the high end. Standard regression wouldn't do. We needed a regression approach that could handle heteroscedasticity.

Enter the quantile regression. This approach has no assumptions on distribution of scores across the regression line, because it actually gives you multiple regression lines - one for each quantile. That is, the outcome variable is divided up into however many quantiles are necessary. You could use quartiles for instance: the scores that divide the distribution into evenly sized quarters. The analysis then produces one regression equation for each division of the scores, which provides the scores that fall within each quantile. This approach is useful for situations in which the relationship between variables is not quite linear and/or when the variance is not quite consistent.

Next week, I'll provide a real example of quantile regression, and show how to conduct this analysis in R, using the quantreg package. Stay tuned!

1 comment:

  1. The assumptions of regression are actually concerning the residuals, not the observations themselves.