0

My training data has extremely class imbalanced {0:872525,1:3335} with 100 features. I use xgboost to build classification model with bayessian optimisation to hypertune the model in range {learning rate:(0.001,0.1), min_split_loss:(0.10), max_depth:(3,70), min_child_weight:(1:20), max_delta_step:(1,20), subsample:(0:1), colsample_bytree:(0.5,1), lambda:(0,10), alpha:(0,10), scale_pos_weight:(1,262), n_estimator:(1,20)}.

I also use binary:logistics as the objective model and roc_auc as the metrics with booster gbtree. The cross validation score is 82.5%. However, when I implemented the model to the testing data I got the score only Roc_auc: 75.2%, pr_auc: 15%, log_loss: 0.046, and confusion matrix: [[19300 7],[103 14]]. I need helping to find the best way to increase the true possitive with tolerance false positive until 3 times actual positive.

German C M
  • 2,744
  • 7
  • 18
zonna
  • 73
  • 1
  • 8

1 Answers1

1

Given an imbalanced dataset and focusing on increasing your true positive rate, it is quite relevant to use the right evaluation metric (the one used to validate the model being trained on each evaluation round).

In this case, I recommend you use Precision-Recall AUC instead of ROC AUC, so you force your model to focus on the minority class. A nice post about it can be found here

Another points to take into account could be:

  • increase the range of possible number of tree estimators in your hyperparameter tuning process
  • set the scale_pos_weight at about your (majority class samples number)/(minority class samples number)
German C M
  • 2,744
  • 7
  • 18