FAQ
overflow

Great Answers to
Questions About Everything

QUESTION

I have a dataset with around 30 independent variables and would like to construct a GLM to explore the relationship between them and the dependent variable.

I am aware that the method I was taught for this situation, stepwise regression, is now considered a statistical sin. What modern methods of model selection should be used in this situation?

{ asked by fmark }

ANSWER

There are several alternatives to Stepwise Regression. The most used I have seen are:

Both PLS Regression and LASSO are implemented in R packages like

PLS: http://cran.r-project.org/web/packages/pls/ and

LARS: http://cran.r-project.org/web/packages/lars/index.html

If you only want to explore the relationship between your dependent variable and the independent variables (e.g. you do not need statistical significance tests), I would also recommend Machine Learning methods like Random Forests or Classification/Regression Trees. Random Forests can also approximate complex non-linear relationships between your dependent and independent variables, which might not have been revealed by linear techniques (like Linear Regression).

A good starting point to Machine Learning might be the Machine Learning task view on CRAN:

Machine Learning Task View: http://cran.r-project.org/web/views/MachineLearning.html

{ answered by Johannes }
Tweet