4

I have a dataset that is highly imbalanced. One class has 412 (class 0) samples while the other has 67215 (class 1) samples. For its classification, I am using MLP. When I use class weight of 165 for class 0 and 1 for class 1 (in keras), I am getting extremely bad results. However, if I oversample the dataset, I am getting really good results. What is the reason behind this?

girl101
  • 1,161
  • 2
  • 11
  • 26

1 Answers1

7

It could be your sampling strategy

If you are oversampling by just duplicating data from class 0, then it is likely that you are overfitting. The same datapoint will be seen over and over.

You could try another oversampling strategy, for example, SMOTE or ADASYN. These techniques create data points that are closets to decision boundaries so you are less inclined to overfit on "easy" data points.

smote and adasyn

Another things you can try is oversampling the minority class and undersampling the majority class that the same time. When choosing a method to do this, pick one that can oversample near decision boundaries and undersample away from decision boundaries. For example, here is SMOTETomek. Notice how classes purple and green get mainly oversampled and class yellow mainly undersampled.

stometomek

These images come from imbalanced-learn which is a Python package you can use for all these sampling strategies.

It could be your pipeline

If you use your oversampled data for testing your model performance, you could be (unwillingly) manipulating your results. You need to ensure that you use your augmented data only for training, and not for validation and testing.

          +-> training set ---> data augmentation --+
          |                                         |
          |                                         +-> model training --+
          |                                         |                    |
all data -+-> validation set -----------------------+                    |
          |                                                              +-> model testing
          |                                                              |
          |                                                              |
          +-> test set --------------------------------------------------+
Bruno Lubascher
  • 3,618
  • 1
  • 14
  • 36