cross validation issues

Question

I have come here from this great answer. I have come across many approaches for using cross validation and the answer to the attached question is by far explaining it the best to me. My dilemma is that now that I m not able to figure out what to use Kfold cross validation for:-

Testing overfitting?
Hyperparam tuning?
Any other use case?

and that too how? I am unable to figure out what to do with the average score that comes after kfold cross val, what to do with the folds and what to do with a model trained on k-1 folds of train data?

score 1 · Answer 1 · answered Oct 21 '19 at 13:03

Cross validation is basically for hyper-parameter tuning.

You train a set of model hyper-parameter setting on kfold cross validation and take the average score from kfold cross validation as the approximate performance of each model hyper-parameter setting. Then, the model hyper-parameter setting which has the highest score will be the choice of your model setting. This hyper-parameter setting can be treat as the best you can get from this kind of model.

Later, you can use this model setting to test the general performance on your test dataset.

score 1 · Accepted Answer · answered Oct 21 '19 at 17:28

Answering the "what to do" point, if you use the scikit-learn GridSearchCV class (from sklearn.model_selection), you can get from it the following:

best params found among the ones you enter with the 'param_grid' input, based on the 'scoring' metric you want (i.e. roc_auc, recall...)
and the most important point, you can directly access the best estimator (i.e. the model instanced with the best hyperparms found in the CV process) already refit with the whole training dataset.

I have seen some source which make a "manual" retrain on the whole training set, but it is not necessary as scikit-learn already let's you accessing the refit model on the whole train set :)

cross validation issues

2 Answers2