QUESTION
I'm brand new to this R thing but am unsure which model to select.
-
I did a stepwise forward regression selecting each variable based on the lowest AIC. I came up with 3 models that I'm unsure which is the "best".
Model 1: Var1 (p=0.03) AIC=14.978 Model 2: Var1 (p=0.09) + Var2 (p=0.199) AIC = 12.543 Model 3: Var1 (p=0.04) + Var2 (p=0.04) + Var3 (p=0.06) AIC= -17.09I'm inclined to go with Model #3 because it has the lowest AIC (I heard negative is ok) and the p-values are still rather low.
I've ran 8 variables as predictors of Hatchling Mass and found that these three variables are the best predictors.
-
My next forward stepwise I choose Model 2 because even though the AIC was slightly larger the p values were all smaller. Do you agree this is the best?
Model 1: Var1 (p=0.321) + Var2 (p=0.162) + Var3 (p=0.163) + Var4 (p=0.222) AIC = 25.63 Model 2: Var1 (p=0.131) + Var2 (p=0.009) + Var3 (p=0.0056) AIC = 26.518 Model 3: Var1 (p=0.258) + Var2 (p=0.0254) AIC = 36.905
thanks!
ANSWER
Looking at individual p-values can be misleading. If you have variables that are colinear (have high correlation), you will get big p-values. This does not mean the variables are useless.
As a quick rule of thumb, selecting your model with the AIC criteria is better than looking at p-values.
One reason one might not select the model with the lowest AIC is when your variable to datapoint ratio is large.
Note that model selection and prediction accuracy are somewhat distinct problems. If your goal is to get accurate predictions, I'd suggest cross-validating your model by separating your data in a training and testing set.
A paper on variable selection: Stochastic Stepwise Ensembles for Variable Selection
Tweet