0

Let's suppose that my dataset in a classification problem looks like that:

  1. class A: 50000 observations
  2. class B: 2000 observations
  3. class C: 800 observations
  4. class D: 200 observations

These are some ways which I considered to deal with this imbalanced dataset:

  1. I reject straight away oversampling because it usually makes the model overfit (in the minority classes) by a lot.

  2. Secondly, if I run the classifier with the data like that then it will be overclassifying documents in class A so I reject this method too.

  3. Another approach is to do undersampling and reduce class A to let's say 4000 documents (where I tested it and it gives the best results so far).

  4. However, in this way I am losing quite a lot of information. So I am wondering if building multiple classifiers with 4000 documents each for class A (different at each time) is a better solution (although I think that this approach resembles quite a lot the oversampling approach which I rejected).

What do you think of method (4) comparing to method (3)?

Outcast
  • 1,117
  • 3
  • 14
  • 29

2 Answers2

0

Maybe your (3) can be complemented to an oversampling of the other classes. I don't think every oversampling generates overfitting. I agree with you, it will make you lose information. But the only way to know if it's that bad is by checking it.

Regarding (4), I don't see how you'll manage the predictions of these multiple classifiers. You could test binary classifiers for each class, like classifier1 for A/not-A, classifier2 for B/not-B. In this case, undersampling could be applied.

However, all these advices are just speculation. You must test on your data, and see the results with more evaluation tools, such as learning curves and feature importance analysis.

Adelson Araújo
  • 300
  • 1
  • 6
0

So your second highest class ( B ) makes up 4% of your highest class ( A ). It is highly imbalanced for those 2 classes on their own. No need to mention the other classes.

For 1) since you have not many samples in those classes especially in C/D , it might work in B but, only you can know that by trying to oversample that class.

None of the suggestions above appeal to me. What i suggest is transforming it into a binary classification x3 models with the following combinations : A/B , A,C , A/D , and try to solve them one by one.

Note that each model would different from the others since B,C,D samples are different.

EDIT : you can deal with each classifier the same as an anomaly detection system.

Blenz
  • 2,124
  • 13
  • 29