I have a dataset with around 30 independent variables and would like to construct a GLM to explore the relationship between them and the dependent variable.
I am aware that the method I was taught for this situation, stepwise regression, is now considered a statistical sin. What modern methods of model selection should be used in this situation?
There are several alternatives to Stepwise Regression. The most used I have seen are:
- Expert opinion to decide which variables to include in the model.
- Partial Least Squares Regression. You essentially get latent variables and do a regression with them. You could also do PCA yourself and then use the principal variables.
- Least Absolute Shrinkage and Selection Operator (LASSO).
Both PLS Regression and LASSO are implemented in R packages like
If you only want to explore the relationship between your dependent variable and the independent variables (e.g. you do not need statistical significance tests), I would also recommend Machine Learning methods like Random Forests or Classification/Regression Trees. Random Forests can also approximate complex non-linear relationships between your dependent and independent variables, which might not have been revealed by linear techniques (like Linear Regression).
A good starting point to Machine Learning might be the Machine Learning task view on CRAN:
Machine Learning Task View: http://cran.r-project.org/web/views/MachineLearning.htmlTweet