Selecting optimal regression model using cross validation

Question

I have a logistic mixed model (lme4 package in R). I want to assess whether participants scores on the measures 'sumspq', 'sumpdi', and 'sumcaps' significantly affect the difference in performance between 2 conditions.

I first run the model:

performance ~ Condition*(sumspq+sumpdi+sumcaps)+ (1|participant)

Results show none of the interactions are significant (all ps >.05)

I check variance inflation factors and confirm that there is no multicollinearity. To double check, no separate simple regressions with the each predictor produces a significant result (e.g., performance ~ Condition*sumspq + 1|participant)

I also want to add covariates to the model to see if these influence the results (an exploratory analysis with limited hypotheses). Covariates are: Age, IQ, sumsens1, sumsens2

An automated stepwise procedure for mixed modelling (buildmer package) is used to find the optimal model with covariates included. In this procedure, the interactions of sumpdi, sumspq, sumcaps are forced to be kept in the model as the variables of interest. The model produced is:

performance ~ Condition*(sumspq+sumpdi+sumcaps+sumsens1)+ Age + IQ + (1|participant)

and has significant effects of Condition*sumsens1 (p=.007), age (p=.02), IQ (p=.04).

THE PROBLEM:

My goal is to ensure that I have selected the optimal model to report. I have noticed that the p-values of the covariates greatly vary based on the combination of covariates included in the model. This is despite there being no evidence of multicollinearity. How do I select the optimal model in terms of confidence in the stability of the estimates/p-values? I have tried using train/test cross-validation (caret package in R; 0.62 accuracy), but I realise this demonstrates the predictive power of the model (generalising to a new dataset) rather than finding the optimal (even out of poor-performing) models.

score 1 · Accepted Answer · answered Jan 12 '24 at 22:38

I am going to go a different route in terms of advice.

If your goal is to do statistical inference (i.e. interpret coefficients, see if the data sugggests a causal relationship, etc.) then there is no need to do model selection. You should have a proposed underlying data generating process (maybe expressed in the form of a directed acyclic graph) and parameters/terms that you are interested in understanding. Variables to include in your model should be included for reasons beyond statistical ones. Are Age and IQ confounding/mediators/pre-treatment variables you wish to control for, based on your understanding of the world? Seeing one interaction term become significant after including age and IQ may not be surprising, but it depends on your understanding of the problem. This could also just be a false positive, since this is now all post-hoc.

In your case, you posited the first model presumably based on external information about the data and the phenomenon that you were interested in studying. That's the model that you should report. Looking for an "optimal" model based on p-values or otherwise is going to expose you to an inflated false positive rate (i.e. significant coefficients detected when in reality they aren't) unless you control for multiple hypothesis tests, which isn’t always straightforward. Look into problems with stepwise regression and its cousins as well, which is notorious for giving false positives in terms of model selection.

Note as well that your model isn't necessarily poor performing. Insignificant coefficients != a poor performing model. Your model just produces insignificant results, i.e. there is not enough signal to conclude that the coefficient isn't 0. This isn’t a bad thing and can be just as insightful.

Selecting optimal regression model using cross validation

1 Answers1