11

I am building a binary classification model for imbalanced data (e.g., 90% Pos class vs 10% Neg Class).

I already balanced my training dataset to reflect a a 50/50 class split, while my holdout (training dataset) was kept similar to the original data distribution (i.e., 90% vs 10%). My question is regarding the validation data used during the CV hyperparameter process. During each iteration fold should:

1) Both the training and test folds be balanced

or

2) The training fold should be kept balanced while the validation fold should be made imbalanced to reflect the original data distribution and holdout dataset.

I am currently using the 1st option to tune my model; however, is this approach valid given that the holdout and validation datasets have different distributions?

thereandhere1
  • 775
  • 1
  • 12
  • 25

3 Answers3

8

Both test and validation datasets should have the same distribution. In such a case, the performance metrics on the validation dataset are a good approximation of the performance metrics on the test dataset. However, the training dataset can be different. Also, it is fine and sometimes helpful to balance the training dataset. On the other hand, balancing the test dataset could lead to a bias estimation from the performance of the model because the test dataset should reflect the original data imbalance. As I mentioned at the beginning the test and validation datasets should have the same distribution. Since balancing the test dataset is not allowed, the validation dataset can not be balanced too.

Additionally, I should mention that when you balance the test dataset, you will get a better performance in comparison to using an unbalanced dataset for testing. And of course, using a balanced test set does not make sense as explained above. So, the resulted performance is not reliable unless you use an unbalanced dataset with the same distribution of classes as the actual data.

Helder
  • 103
  • 4
nimar
  • 758
  • 3
  • 8
2

In my opinion the validation set should follow the original imbalanced distribution: the goal is ultimately to apply the model to the real distribution so the hyper-parameters should be chosen to maximize performance for this distribution.

But since I'm not completely sure I'd suggest trying both options, and adopt the one which gives the best performance on the test set.

Erwan
  • 26,519
  • 3
  • 16
  • 39
0

Just sharing what I believe is the reasoning behind the need for a balanced dataset. { Assuming we are talking about a supervised classification :) }

The training dataset is the only piece of data that "teaches" the model how to perform the classification. If you train the model with an unbalanced dataset (A:90; B:10), the model could be lazy enough to classify everything as A and the accuracy will be 90% without the ability to distinguish A and B. The loss function won't be able to guide the training steps towards the real ability to generalize. Therefore, balancing the training dataset forces the model to learn the underlying reasons to classify something as A or B.

But the validation and test should reflect the real distribution of the data. Validation shows when the model is getting too specialized in the training dataset and losing the ability to generalize. Therefore, it is the connection between what the training is producing and the test that will hopefully make good use as a crystal ball once the model is defined.

The test is used after the model selection and this is the first time data don't give you any hint of how the model should be. You can see this part of the data as "the future" from the model perspective.

It's worth mentioning if the problem is focused on outlier detection, it may require more sophisticated approaches.