Why did sampling boost the performance of my model?

Question

I have an imbalanced dataset with 88 positive samples and 128575 negative samples. I was reluctant to over/undersample the data since it's a biological dataset and I didn't want to introduce synthetic data. I built a Random Forest Classifier with this original dataset. I got an F1 score of 0 for the positive class. Zero precision. Zero recall. I cross-checked the predictions and test data. The model predicts some positives none of which are actually positive. Worst performance.

So, I tried to oversample the positive class. I upsampled the positives to 1000 samples. To my surprise, the F1 score for this dataset was 0.97, for the positive class. Then I tried lesser samples. I was able to achieve an F1 score of 0.83 with 200 positive samples, which is just 2.25 times of the original positive samples.

I would like to know why this occurs. For 88 samples, F1 score is 0.00 (rounded off to two digits). For 200 samples it's 0.83. There is no data leakage. All the features are engineered. I used imbalanced-learn module for oversampling. Can someone explain why is this difference in performance?

Ben Reiniger · Accepted Answer · 2019-09-27T21:45:40.587

As you mentioned in a comment, you are upsampling before splitting the test set, which leads to data leakage; your scores are not to be trusted. The problem is that a given positive sample may be duplicated and then put into both the training and the test set. Especially with tree models, this is very likely to correctly predict that sample in the test set. The story with SMOTE is similar, but as you pointed out, not quite so severe. In SMOTE you're interpolating between positive samples (see image from imb-learn docs), so if some of those points are in the training set and some in the testing set you're still more likely to correctly identify those points.

Instead, you should split first, upsample the training set second. Alternatively, set class weights (this has the benefit of being independent of the split). Now your test set has a different distribution that the training set, so you'll need to adjust the class prediction threshold, or adjust the probability predictions. See e.g. "Convert predicted probabilities after downsampling to actual probabilities in classification?". Part of the question here is whether you want actual estimates of the probabilities, or just care about the class predictions.

There's a serious question about whether resampling techniques are helpful at all. See e.g.
"What is the root cause of the class imbalance problem?"
"When is unbalanced data really a problem in Machine Learning?"
As a first attempt, I would stick with the original data, fit the random forest, and have a look at different thresholds.

In your case, I would worry that 88 positive samples may just not be enough to see a meaningful pattern. (It might be; it depends on how separated the classes are.)

score 0 · Answer 2 · answered Sep 25 '19 at 18:26

When you attempt to train your model without sampling- keeping the imbalanced classes, your model is learning that the easiest way to classify the data is to label everything negative. From an accuracy perspective (total number of correctly classified for each class divided by the total number of instances) your model will have an accuracy of $\frac{128487}{128575}$ or 99%. Essentially it extremely underfits your data to be all one class.

Oversampling corrects the imbalance, and makes your algorithm work a little bit harder to figure out the true shape of the data. Lumping everything into one category won't work. You could have also corrected your imbalance by undersampling the negative class. Typically the rule of thumb is undersampling when you have tens of thousands to hundreds of thousands of rows, and oversampling when your data is smaller (tens of thousands or less).

Here is a good reference for dealing with class imbalances in machine learning.

score 0 · Answer 3 · answered Sep 25 '19 at 18:27

If I understand the situation described by OP, the answer is already in the question. It's because OP has an imbalanced training set. With only a double digit number of positive samples out of over 120000, the model would have the most statistical success just always predicting negative all the time.

It is not incorrect to over-sample or under-sample data (biological or otherwise), so re-sampling is a perfectly legitimate solution if done carefully. Repetition works if you really want to avoid synthetic data, but otherwise there are a number of techniques for that as well (e.g. SMOTE).

Why did sampling boost the performance of my model?

3 Answers3

Linked