Wednesday, October 4, 2017

Statistical Sins: Stepwise Regression

This evening, I started wondering: what do other statisticians think are statistical sins? So I'm perusing message boards on a sleepless Tuesday night/Wednesday morning, and I've found one thing that pops up again and again: stepwise regression.

No stairway. Denied.
Why? Stepwise regression is an analysis process in which one adds or subtracts predictors in a regression equation based on whether they are significant or not. There are, then, two types of stepwise regression: forwards and backwards.

In either analysis, you would generally choose your predictors ahead of time. But then, there's nothing that says you can't include far more predictors than you should (that is, more than the data can support), or predictors that have no business being in a particular regression equation.

In forward stepwise regression, the program would select the variable among your identified predictors that is most highly related to the outcome variable. Then it adds the next most highly correlated predictor. It keeps doing this until additional predictors result in no significant improvement of the model (significant improvement being determine by change in R2).

In backward stepwise regression, the program includes all of your predictor variables, then begins removing variables with the smallest effect on the outcome variable. It stops when removing a variable results in a significant decrease in explained variance.

As you can probably guess, this analysis approach is rife with the potential of false positives and chance relationships. Many of the messages boards said, rightly, there is basically no situation where this approach is justified. It isn't even good exploratory data analysis; it's just lazy.

But is there a way this analysis technique could be salvaged? Possibly, if one took a page from the exploratory data analysis playbook and first plotted data, examined potential confounds and alternative explanations for relationships between variables, then made an informed choice about the variables to include in the analysis.

And, most importantly, the analyst should have a way of testing a stepwise regression procedure in another sample, to verify the findings. Let's be honest; to use a technique like this one, where you can add in any number of predictors, you should have a reasonably large sample size or else you should find a better statistic. Therefore, you could randomly split your sample into a development sample, where you determine best models, and a testing sample, where you confirm the models created through the development sample. This approach is often used in data science.

BTW, I've had some online conversations with people about the term data science and I've had the chance to really think about what it is and what it means. Look for more on that in my next Statistics Sunday post!

What do you think are the biggest statistical sins?


  1. Again, your insights are so helpful.

  2. What about stepwise regression using AIC/BIC? From what I've learned, this results in models that also include non-significant predictors.

  3. Seems that with p-hacking going on in fields like social psych, mechanical model selection might be reasonable to reduce researcher df etc. What about newer techniques like LAR? Wouldn't they have similar problems? No doubt mechanical model selection has some issues in terms of how predictors are related, but in absence of a theoretically informed model, these mechanical selection techniques seem refreshingly transparent. That ought to be worth something. Rob

  4. Sara, your concerns are real, but the real problem is atheoretical use of any statistic--the "kitchen Sink" approach--that is really taking on in non-academic circles. I use backward stepwise regression when predicting a final model from variables that are linked by theory to the outcome. I like backward because the process compares R2 change at each step and no variable is guaranteed a place in the equation, something forward stepwise doesn't do--once a variable is in the model, it is there regardless of impact of other variables down the line. I've used both approaches to test models and had some interesting outcomes.

  5. The main problem is misuse of stepwise regression, not stepwise regression in itself. The classic misuse is to compute inferential statistics on the regression coefficients as if the final selected variables were selected a priori by the researcher, as opposed to being selected on the basis of model fit. This inflates type 1 error rate. Nonetheless, as an exploratory data analysis technique, you could do worse than stepwise regression.

  6. All excellent points. I suppose it's important to keep in mind that statistics are tools. Tools aren't inherently good or bad; it's in how they're used. A hammer can build a house (good) or smash your finger (bad), but because of how the tool was used. Stepwise regression could be perfectly justifiable if used in a certain way. My experience has been that it's used as a way to weed through dozens of predictor variables, some of which were collected for dubious or completely unrelated purposes. (E.g., researcher at a center adds a measure of X in another person's study, because researcher studies X and wants more data or because the granting agencies are interested in X, not because X even makes sense in the context of the study. I've seen this happen a lot.)

  7. If we think of stepwise regression as a tool, it is never a good choice because with today's computing power there is a better tool--all subsets regression. Stepwise regression, contrary to above comments, does both forward and backward deleting. After several variables are added, it considers whether any can be dropped. Its branching algorithm is not optimal and so it is not guaranteed to find the best model for a given number of predictors. All subsets regression is guaranteed to find the best model. So if one wants to turn an algorithm loose on the data, the dominating choice is all subsets. The only value of stepwise regression was in the old days when computers were slow and all subsets regression was not an option. It is an obsolete tool that should never be used. Whether any statistical tool should be used in an atheoretical manner like this is another matter. But if one wants to use a tool to do that, don't use stepwise.

  8. All subsets doesn't really solve the problem. It may find the the best model, but it's still subject to the problem of capitalizing on the idiosyncrasies of the sample. Approaches like penalized ML and lasso are far better alternatives. See, for example

  9. How about the (possibly?) related issue of Predictive Analysis (by various names) wherein the practitioners disclaim interest in interpretive meaning.

    From page 4 of "Applied Predictive Modeling" - Kuhn/Johnson:
    "Furthermore the foremost objective of these examples is not to understand why something will (or will not) occur. Instead, we are primarily interested in accurately predicting the chances that something will (or will not) happen."