Is it alright to split a GridSearchCV?

Question

Is it ok to split a GridsearchCV?

At first, I would try estimators from 100-300 (100 steps) for a random forest regressor and some other parameters and after that, I would start the GridsearchCV with the same parameter and just change the estimators from 400 - 600.

Is there any aspect that would disagree with that logic?

Carlos Mougan · Answer 1 · 2020-01-25T23:16:14.733

First my understanding of your problem. You want to find the best hyperparameters for a Random Forest.

For that, you want to first adjust n_estimators parameter and then the rest of parameters in different runs.

Before answering to your question, you will only want to do a thorough search of hyperparameters when you want to have an improvement of around 1%. So it will be a small improvement. If you want to improve your model, probably feature engineering or data engineering will give you a better improvement. Or even a different algorithm.

You can print the results of your GridsearchCV with

pd.DataFrame(clf.cv_results_)

The answer to your question:

No you shouldn't run a GridsearchCV in different runs you have to explore the whole parameters if you want to find the global minima. A small change in one parameter can affect other. At the end you are exploring a search space.

Ben Reiniger · Answer 2 · 2020-01-25T01:54:43.960

Edit: oh, now I think I see why @CarlosMougan said no. You said

...start the same GridsearchCV with the same parameter and just change...

If you mean use the optimal values for all hyperparameters except n_estimators and now search only on that one hyperparameter, then Carlos is right, and for the right reason. Below, I interpreted your suggestion as searching over the whole space again, except with new range for n_estimators.

I don't see any reason that you can't do this. You might want to fix the cv-splits ahead of time and use the same ones for both runs of the grid search, too keep the comparisons completely fair. (In sklearn, this means passing cv as either one of their CV generators or as an iterable.)

This approach makes sense particularly in case

you want to examine some results right away, so dump some smaller grid to look at while running the next grid. (This sort of matches your case, where run times(?) are high.)
you expected the first grid to be all, but find one hyperparameter always performs best at the edge of your grid, so now you want top extends its range.

Finally, please note that the number of trees in a random forest has little to do with performance; rather, more trees just stabilizes some of the randomness in the tree construction. So generally, you want to set it "high enough," while not so high that computation is needlessly long.

Is it alright to split a GridSearchCV?

2 Answers2