-1

I am using GridSearchCV for optimising my predictions

I am running a fairly large dataset and I am afraid I have not optimised the parameters enough.

df_train.describe():    
         Unnamed: 0           col1           col2           col3           col4          col5
count  8.886500e+05  888650.000000  888650.000000  888650.000000  888650.000000  888650.000000
mean   5.130409e+05       2.636784       3.845549       4.105381       1.554918       1.221922
std    2.998785e+05       2.296243       1.366518       3.285802       1.375791       1.233717
min    4.000000e+00       1.010000       1.010000       1.010000       0.000000       0.000000
25%    2.484332e+05       1.660000       3.230000       2.390000       1.000000       0.000000
50%    5.233705e+05       2.110000       3.480000       3.210000       1.000000       1.000000
75%    7.692788e+05       2.740000       3.950000       4.670000       2.000000       2.000000
max    1.097490e+06      90.580000      43.420000      99.250000      22.000000      24.000000

df_test.describe(): Unnamed: 0 col1 col2 col3 col4 col5 count 390.000000 390.000000 390.000000 390.000000 0.0 0.0 mean 194.500000 3.393359 4.016821 3.761385 NaN NaN std 112.727548 4.504227 1.720292 3.479109 NaN NaN min 0.000000 1.020000 2.320000 1.020000 NaN NaN 25% 97.250000 1.792500 3.272500 2.220000 NaN NaN 50% 194.500000 2.270000 3.555000 3.055000 NaN NaN 75% 291.750000 3.172500 4.060000 4.217500 NaN NaN max 389.000000 50.000000 18.200000 51.000000 NaN NaN

The way I am using GridSearchCV is as follows:

rf_h = RandomForestRegressor()
rf_a = RandomForestRegressor()

Using GridSearch for Optimisation

param_grid = { 'max_features': ['auto', 'sqrt', 'log2'] }

rf_g_h = GridSearchCV(estimator=rf_h, param_grid=param_grid, cv=3, n_jobs=-1) rf_g_a = GridSearchCV(estimator=rf_a, param_grid=param_grid, cv=3, n_jobs=-1)

Fitting dataframe to prediction engine

rf_g_h.fit(X_h, y_h) rf_g_a.fit(X_a, y_a)

How can I optimise param_grid and hence determine best_params_ of the same?

What would be the best matrix for n_estimators for this dataset?

PyNoob
  • 83
  • 9

2 Answers2

1

In general there's no way to know the best values to try for a parameter. The only thing one can do is to try many possible values, but:

  • this mathematically requires more computing time (see this question about how GridSearchCV works)
  • there is a risk of overfitting the parameters, i.e. selecting a value which is optimal by chance on the validation set.
Erwan
  • 26,519
  • 3
  • 16
  • 39
1

Instead of GridSearchCV you should try Optuna. It is much faster than GridSearchCV.

But apart from that, coming to your question, there is no best value for a hyperparameter per se! Period! It depends on what kind of data you have. What hyperparameter value works for one dataset might not work for another dataset.

Also another point to keep in mind, there are many parameters for a model like Random Forest model. Including all of them with a wide range of values in your gridsearch will take forever. Instead include only those parameters that give the maximum improvement in your results (aka those that matter the most!). Here is a link to a blog that might help you: https://blog.dataiku.com/narrowing-the-search-which-hyperparameters-really-matter

Hope it helps!

spectre
  • 2,223
  • 2
  • 14
  • 37