|No stairway. Denied.|
In either analysis, you would generally choose your predictors ahead of time. But then, there's nothing that says you can't include far more predictors than you should (that is, more than the data can support), or predictors that have no business being in a particular regression equation.
In forward stepwise regression, the program would select the variable among your identified predictors that is most highly related to the outcome variable. Then it adds the next most highly correlated predictor. It keeps doing this until additional predictors result in no significant improvement of the model (significant improvement being determine by change in R2).
In backward stepwise regression, the program includes all of your predictor variables, then begins removing variables with the smallest effect on the outcome variable. It stops when removing a variable results in a significant decrease in explained variance.
As you can probably guess, this analysis approach is rife with the potential of false positives and chance relationships. Many of the messages boards said, rightly, there is basically no situation where this approach is justified. It isn't even good exploratory data analysis; it's just lazy.
But is there a way this analysis technique could be salvaged? Possibly, if one took a page from the exploratory data analysis playbook and first plotted data, examined potential confounds and alternative explanations for relationships between variables, then made an informed choice about the variables to include in the analysis.
And, most importantly, the analyst should have a way of testing a stepwise regression procedure in another sample, to verify the findings. Let's be honest; to use a technique like this one, where you can add in any number of predictors, you should have a reasonably large sample size or else you should find a better statistic. Therefore, you could randomly split your sample into a development sample, where you determine best models, and a testing sample, where you confirm the models created through the development sample. This approach is often used in data science.
BTW, I've had some online conversations with people about the term data science and I've had the chance to really think about what it is and what it means. Look for more on that in my next Statistics Sunday post!
What do you think are the biggest statistical sins?