35

I have a classification problem with approximately 1000 positive and 10000 negative samples in training set. So this data set is quite unbalanced. Plain random forest is just trying to mark all test samples as a majority class.

Some good answers about sub-sampling and weighted random forest are given here: What are the implications for training a Tree Ensemble with highly biased datasets?

Which classification methods besides RF can handle the problem in the best way?

IgorS
  • 5,474
  • 11
  • 34
  • 43

4 Answers4

24
  • Max Kuhn covers this well in Ch16 of Applied Predictive Modeling.
  • As mentioned in the linked thread, imbalanced data is essentially a cost sensitive training problem. Thus any cost sensitive approach is applicable to imbalanced data.
  • There are a large number of such approaches. Not all implemented in R: C50, weighted SVMs are options. Jous-boost. Rusboost I think is only available as Matlab code.
  • I don't use Weka, but believe it has a large number of cost sensitive classifiers.
  • Handling imbalanced datasets: A review: Sotiris Kotsiantis, Dimitris Kanellopoulos, Panayiotis Pintelas'
  • On the Class Imbalance Problem: Xinjian Guo, Yilong Yin, Cailing Dong, Gongping Yang, Guangtong Zhou
charles
  • 356
  • 1
  • 3
20

Undersampling the majority class is usually the way to go in such situations.

If you think that you have too few instances of the positive class, you may perform oversampling, for example, sample 5n instances with replacement from the dataset of size n.

Caveats:

  • Some methods may be sensitive to changes in the class distribution, e.g. for Naive Bayes - it affects the prior probabilities.
  • Oversampling may lead to overfitting
Alexey Grigorev
  • 2,900
  • 1
  • 15
  • 19
14

Gradient boosting is also a good choice here. You can use the gradient boosting classifier in sci-kit learn for example. Gradient boosting is a principled method of dealing with class imbalance by constructing successive training sets based on incorrectly classified examples.

cwharland
  • 961
  • 7
  • 10
1

In addition to the answers posted here, if the number of positive examples are way too small when compared to the negative examples, then it comes close to being an anomaly detection problem where the positive examples are the anomalies.

You have a whole range of methods for detecting anomalies ranging from using multivariate gaussian distribution to model all the points and then picking those that are 2 or 3 stds away from the mean.

Another food for thought - I have seen quite a few people who randomly sample the negative examples with more examples so that both the classes are same in number. It totally depends on the problem at hand, whether we want them to be balanced or not.

Ram
  • 349
  • 3
  • 2