What would be a good n_estimators matrix and thus param_grid for this problem?

Question

I am using GridSearchCV for optimising my predictions

I am running a fairly large dataset and I am afraid I have not optimised the parameters enough.

df_train.describe():    
         Unnamed: 0           col1           col2           col3           col4          col5
count  8.886500e+05  888650.000000  888650.000000  888650.000000  888650.000000  888650.000000
mean   5.130409e+05       2.636784       3.845549       4.105381       1.554918       1.221922
std    2.998785e+05       2.296243       1.366518       3.285802       1.375791       1.233717
min    4.000000e+00       1.010000       1.010000       1.010000       0.000000       0.000000
25%    2.484332e+05       1.660000       3.230000       2.390000       1.000000       0.000000
50%    5.233705e+05       2.110000       3.480000       3.210000       1.000000       1.000000
75%    7.692788e+05       2.740000       3.950000       4.670000       2.000000       2.000000
max    1.097490e+06      90.580000      43.420000      99.250000      22.000000      24.000000
df_test.describe():
         Unnamed: 0      col1        col2        col3        col4        col5
count  390.000000  390.000000  390.000000  390.000000         0.0         0.0
mean   194.500000    3.393359    4.016821    3.761385         NaN         NaN
std    112.727548    4.504227    1.720292    3.479109         NaN         NaN
min      0.000000    1.020000    2.320000    1.020000         NaN         NaN
25%     97.250000    1.792500    3.272500    2.220000         NaN         NaN
50%    194.500000    2.270000    3.555000    3.055000         NaN         NaN
75%    291.750000    3.172500    4.060000    4.217500         NaN         NaN
max    389.000000   50.000000   18.200000   51.000000         NaN         NaN

The way I am using GridSearchCV is as follows:

rf_h = RandomForestRegressor()
rf_a = RandomForestRegressor()
Using GridSearch for Optimisation
param_grid = {
'max_features': ['auto', 'sqrt', 'log2']
}
rf_g_h = GridSearchCV(estimator=rf_h, param_grid=param_grid, cv=3, n_jobs=-1)
rf_g_a = GridSearchCV(estimator=rf_a, param_grid=param_grid, cv=3, n_jobs=-1)
Fitting dataframe to prediction engine
rf_g_h.fit(X_h, y_h)
rf_g_a.fit(X_a, y_a)

How can I optimise param_grid and hence determine best_params_ of the same?

What would be the best matrix for n_estimators for this dataset?

score 1 · Accepted Answer · answered Jul 16 '21 at 10:30

In general there's no way to know the best values to try for a parameter. The only thing one can do is to try many possible values, but:

this mathematically requires more computing time (see this question about how GridSearchCV works)
there is a risk of overfitting the parameters, i.e. selecting a value which is optimal by chance on the validation set.

score 1 · Answer 2 · answered Jul 19 '21 at 12:05

Instead of GridSearchCV you should try Optuna. It is much faster than GridSearchCV.

But apart from that, coming to your question, there is no best value for a hyperparameter per se! Period! It depends on what kind of data you have. What hyperparameter value works for one dataset might not work for another dataset.

Also another point to keep in mind, there are many parameters for a model like Random Forest model. Including all of them with a wide range of values in your gridsearch will take forever. Instead include only those parameters that give the maximum improvement in your results (aka those that matter the most!). Here is a link to a blog that might help you: https://blog.dataiku.com/narrowing-the-search-which-hyperparameters-really-matter

Hope it helps!

What would be a good n_estimators matrix and thus param_grid for this problem?

Using GridSearch for Optimisation

Fitting dataframe to prediction engine

2 Answers2