3

Im working on a regression problem with 400 samples and 7 features, to predict job durations of machineries from historical data. Im using XGboost and (90,10) split works better than (80,20) split. Is this normal? I think im overfitting but I do not know how to properly check this.

(90,10) split Train r2: 0.99 Test r2: 0.96

(80,20)split Train r2: 0.91 Test r2: 0.76

I performed k-fold cross validation (randomized search cv) aswell and for train and test the results were like this:

*Test Set Performance: MSE: 219.16 RMSE: 14.80 MAE: 4.31 R²: 0.78

*Training Set Performance: MSE: 11.18 RMSE: 3.34 MAE: 1.87 R²: 0.99

(I must mention that total duration varies a lot per batches(unique production) but the avg tot duration is 33 hours)

barcamela
  • 31
  • 3

1 Answers1

7

There are two effects being traded off with different test split sizes: the larger training set should improve the model, generally speaking; but the smaller test set means the R2 score you're reporting is probably a less precise estimate of the true performance of the model on future data. Because of the latter, you can't "test" different splits and pick the best score, so the size of the split needs to come from some less-data-driven decision: is 40 data points in your context enough to reasonably accurately estimate performance?

Probably the best thing to do, especially when you have a small sample size like here, is to perform repeated testing, e.g. with (repeated) k-fold cross-validation or bootstrapping.

As for overfitting, in the sense of "the models have learned information specific to the training data (and not generalizable, or 'noise')", yes, both models are overfit as evidenced by the drop in scores between train and test. But that's almost always true of GBMs, and that isn't always a problem.

Lastly, just wanted to share an issue with sklearn's measurement of test set R2: https://stats.stackexchange.com/q/590199/232706

Ben Reiniger
  • 12,855
  • 3
  • 20
  • 63