Quick guide into training highly imbalanced data sets

Question

I have a classification problem with approximately 1000 positive and 10000 negative samples in training set. So this data set is quite unbalanced. Plain random forest is just trying to mark all test samples as a majority class.

Some good answers about sub-sampling and weighted random forest are given here: What are the implications for training a Tree Ensemble with highly biased datasets?

Which classification methods besides RF can handle the problem in the best way?

score 24 · Accepted Answer · answered Sep 13 '14 at 15:36

Max Kuhn covers this well in Ch16 of Applied Predictive Modeling.
As mentioned in the linked thread, imbalanced data is essentially a cost sensitive training problem. Thus any cost sensitive approach is applicable to imbalanced data.
There are a large number of such approaches. Not all implemented in R: C50, weighted SVMs are options. Jous-boost. Rusboost I think is only available as Matlab code.
I don't use Weka, but believe it has a large number of cost sensitive classifiers.
Handling imbalanced datasets: A review: Sotiris Kotsiantis, Dimitris Kanellopoulos, Panayiotis Pintelas'
On the Class Imbalance Problem: Xinjian Guo, Yilong Yin, Cailing Dong, Gongping Yang, Guangtong Zhou

score 20 · Answer 2 · answered Sep 12 '14 at 20:30

Undersampling the majority class is usually the way to go in such situations.

If you think that you have too few instances of the positive class, you may perform oversampling, for example, sample 5n instances with replacement from the dataset of size n.

Caveats:

Some methods may be sensitive to changes in the class distribution, e.g. for Naive Bayes - it affects the prior probabilities.
Oversampling may lead to overfitting

score 14 · Answer 3 · answered Sep 13 '14 at 18:17

Gradient boosting is also a good choice here. You can use the gradient boosting classifier in sci-kit learn for example. Gradient boosting is a principled method of dealing with class imbalance by constructing successive training sets based on incorrectly classified examples.

score 1 · Answer 4 · answered Jul 15 '16 at 22:10

In addition to the answers posted here, if the number of positive examples are way too small when compared to the negative examples, then it comes close to being an anomaly detection problem where the positive examples are the anomalies.

You have a whole range of methods for detecting anomalies ranging from using multivariate gaussian distribution to model all the points and then picking those that are 2 or 3 stds away from the mean.

Another food for thought - I have seen quite a few people who randomly sample the negative examples with more examples so that both the classes are same in number. It totally depends on the problem at hand, whether we want them to be balanced or not.

Quick guide into training highly imbalanced data sets

4 Answers4

Linked