Sunday, February 18, 2018

Statistics Sunday: What Are Residuals?

Recently, I wrote two posts about quantile regression - find them here and here. In the first post, I talked about an assumption of linear regression: homoscedasticity, "the variance of scores at one point on the line should be equal to the distribution of scores at other points along the line."


What I was talking about here has to do with the difference between the predicted value (which falls on the regression line) and the observed value (the points, which may or may not fall along the line). This difference is called a residual:

Residual = Observed - Predicted

There is one residual for each case used in your regression, and both the sum and mean of the residuals will be 0. In standard linear regression, which is known as ordinary least squares regression, the goal is to find a line that minimizes the residuals - some residuals will be positive and some will be negative. Remember, when you're dealing with deviations from some measure of central tendency, you need to square them, or else they add up to 0. Your regression line, then, is one that minimizes these squared residuals - least squares.

But, you might ask, why does variance need to be consistent across the regression line? Sure, you want a line that minimizes your residuals, but obviously, your residuals are going to vary to some degree. 

When I first introduced linear regression to you, I gave this basic linear equation you've probably encountered before:

y = bx + a

where b is the slope (measure of the effect of x on y), and a is the constant (the predicted value of y when x=0). But I left out one more term you'd find in linear regression - the error term:

y = bx + a + e

For this error term to be valid, there has to be consistent error - you don't want the error to be wildly different for some points than others. Otherwise, you can't model it (not with a single term, anyway).

When you conduct a linear regression, you should plot your residuals. This is easy to do in R. Let's use the same dataset I used for the quantile regression example, and we'll conduct a linear regression with these data (even though we're violating a key assumption - this is just to demonstrate the procedure). Specify the linear regression using lm, followed by the model outcome~predictor(s):

library(quantreg)
library(ggplot)
options(scipen=999)
model<-lm(foodexp~income, engel)
summary(model)

That gives me the following output:


I can reference my residuals with resid(model_name):

ggplot(engel, aes(foodexp,resid(model))) + geom_point()


You'll notice a lot of the residuals cluster around 0, meaning the predicted values are close to the observed values. But for higher values of y, the residuals are also larger, so this plot once again demonstrates that ordinary least squares regression isn't suited for this dataset. Yet another reason why it's important to look at your residuals.

I'm out of town at the moment, attending the Association of Test Publishers Innovations in Testing meeting. Hopefully I'll have some things to share from the conference over the next couple days. At the very least, it's a lot warmer here than it is in Chicago.

No comments:

Post a Comment