10

Which of the below set of steps options is the correct one when creating a predictive model?

Option 1:

First eliminate the most obviously bad predictors, and preprocess the remaining if needed, then train various models with cross-validation, pick the few best ones, identify the top predictors each one has used, then retrain those models with those predictors only and evaluate accuracy again with cross-validation, then pick the best one and train it on the full training set using its key predictors and then use it to predict the test set.

Option 2:

First eliminate the most obviously bad predictors, then preprocess the remaining if needed, then use a feature selection technique like recursive feature selection (eg. RFE with rf ) with cross-validation for example to identify the ideal number of key predictors and what these predictors are, then train different model types with cross-validation and see which one gives the best accuracy with those top predictors identified earlier. Then train the best one of those models again with those predictors on the full training set and then use it to predict the test set.

cego
  • 3
  • 2
A K
  • 103
  • 4

2 Answers2

17

I found both of your options slightly faulty. So, this is generally (very broadly) how a predictive modelling workflow looks like:

  • Data Cleaning: Takes the most time, but every second spent here is worth it. The cleaner your data gets through this step, the lesser would your total time spent would be.
  • Splitting the data set: The data set would be splitted into training and testing sets, which would be used for the modelling and prediction purposes respectively. In addition, an additional split as a cross-validation set would also need to be done.
  • Transformation and Reduction: Involves processes like transformations, mean and median scaling, etc.
  • Feature Selection: This can be done in a lot of ways like threshold selection, subset selection, etc.
  • Designing predictive model: Design the predictive model on the training data depending on the features you have at hand.
  • Cross Validation:
  • Final Prediction, Validation
Dawny33
  • 8,476
  • 12
  • 49
  • 106
3

Where the feature selection finds a place in your pipeline depends on the problem. If you know your data well, you can select features based on this knowledge manually. If you don't - the experimentation with the models using cross validation may be best. Reducing number of features a priory with some additional technique like chi2 or PCA may actually reduce model accuracy.

In my experience with text classification with SGD classifier for example leaving all hundred thousands words encoded as binary features brought better results compared to reducing to a few thousands or hundreds. Training time is actually faster with all features as feature selection is rather slow with my toolset (sklearn) because it is not stochastic like SGD.

Multicollinearity is something to watch out for, but the feature interpretability might equally be important.

Then people report getting best result with ensembles of models. Each model capturing a particular part of information space better than the others. That would also preclude you from selecting the features before fitting all models you'd include into your ensemble.

Diego
  • 550
  • 2
  • 8