## Sunday, June 4, 2017

### Statistics Sunday: Linear Regression

Back in Statistics in Action, I blogged about correlation, which measures the numerical strength of a linear relationship between two variables. Today, I'd like to talk about a similar statistic, that differs mainly in how you apply and interpret it: linear regression.

Recall that correlation ranges from -1 to +1 (with 0 indicating no relationship, and the sign indicating the direction: one goes up the other goes up is positive and one goes up the other goes down is negative). That's because correlation is standardized: to compute a correlation, you have to convert values to Z-scores. Regression is essentially correlation, with a few key differences.

First of all, here's the equation for linear regression, which I'm sure you've seen some version of before:

y = bx + a

You may have seen it instead as y = mx + b or y = ax + b. It's a linear equation:

A linear equation is used to describe a line, using two variables: x and y. That's all regression is. The difference is that the line is used as an approximation of the relationship between x and y. We recognize that not every case falls perfectly on the line. The equation is computed so that it gets as close to the original data as possible, minimizing the (squared) deviations between the actual score and the predicted score. (BTW, this approach is called least squares, because it minimizes the squared deviations - as usual, we square the deviations so they don't add up to 0 and cancel each other out.)

As with so many statistics, regression uses averages (means). To dissect this equation (using the first version I gave above), b is the slope, or the average amount y changes for each 1 unit change in x. a is the constant, or the average value of y when x is equal to 0. Because we have one value for slope, we assume there is a linear relationship between y and x, that is the relationship is the same across all possible values. So regardless of which values we choose for x and y (within our possible ranges), we expect the relationship to be the same. There are other regression approaches we use if and when we think the relationship is non-linear, which I'll blog about later on.

Because our slope is the amount of change we expect to see in y and our constant is the average value of y for x=0, these two values are in the same units as our y variable. So if we were predicting how tall a person is going to grow in inches, y, the slope (b), and the constant (a) would all be in inches. If we use standardized values, which is an option in most statistical programs, our b would be equal to the correlation between x and y.

But what if we want to use more than one x (or predictor) variable? We can do that, using a statistic called multiple linear regression. We would just add more b's and x's to the equation above, giving each a subscript number (1, 2, ...). There are many cases where more than one variable would predict our outcome.

For instance, it's rumored that many graduate schools have a prediction (regression) equation they use to predict grad school GPA of applicants, using some combination of test scores, undergraduate GPA, and strength of recommendation letters, to name a few. They're not sharing what that equation is, but we're all very sure they use them. The problem when we use multiple predictors is that they are probably also related to each other. That is, they share variance and may predict some of the same variance in our outcome. (Using the grad school example, it's highly likely that someone with a good undergraduate GPA will also have, say, good test scores, making these two predictors correlated with each other.)

So when you conduct multiple linear regression, you're not only taking into account the relationship between each predictor and the outcome; you're also correcting for the fact that the predictors are correlated with each other. So when you're conducting multiple regression, you want to check the relationship between your predictors. If two variables are highly related to each other, to the point that one could be used as a proxy for the other, your variables are collinear, meaning that they predict the same variance in your outcome. Weird things happen when you have collinear variables. If the shared variance is very high (almost full overlap in a Venn diagram), you might end having a variable that should have a positive relationship with the outcome showing a negative slope. This is because one variable is correcting for overprediction; if this happens, we call it suppression. The only way to deal with it is to drop one of the collinear variables.

Obviously, it's unlikely that your regression equation will perfectly describe the relationship between/among variables. The equation will always be an approximation. So we measure how good our regression equation is at predicting outcomes using various metrics, including the proportion of variance in the outcome variable (y) predicted by the x('s), as well as how far the predicted y's (using the equation) are from the actual y's - we call this metric residuals.

In a future post, I'll show you how to conduct a linear regression. It's actually really easy to do in R.