4

When using machine learning models like gradient boosted trees and CNN, is it required (or considered as an always-do good practice) to balance the amount of positive/negative examples when learning for binary classification?

Given P positive examples and N negative examples, where P << N, I can think of several choices: (Let's forget about validation set and test set)

Choice A) No balancing at all, put all examples (totally P+N) into the training set without weighting w.r.t. their ratio.

Choice B) Put all examples (totally P+N) into the training set, but weight all positive examples 1/2P and all negative examples 1/2N, so that total weight of positive examples and negative example equal.

Choice C) Take all P positive examples, then sample P negative examples (out of N), and train with these 2P examples with uniform weighting.

What are the pros/cons for each of the approach and which one(s) do we usually go with?

Roy
  • 289
  • 3
  • 9

1 Answers1

2

Let's start by answering your first question. Is it required to balance the dataset?

Absolutely, the reason is simple in failing to do so you end up with algorithmic bias. This means that if you train your classifier without balancing the classifier has a high chance of favoring one of the classes with the most examples. This is especially the case with boosted trees. Even normal decision trees, in general, have the same effect. So it is always important to balance the dataset

Now let's discuss the three different scenarios placed.

Choice A): This would be what I explained all along. I'm not saying necessarily you will have a bias. It depends on the dataset itself. If the nature of the dataset has a very fine distinction with the boundaries then the chance of misclassification is reduced, you might get a decent result but it's still not recommended. Also if the data does not have good boundaries then the rate of misclassification rises a lot.

Choice B): Since you are placing weights for each sample you are trying to overcome the bias with a penalty. This is also called as an Asymmetric method. Normally these methods increase the accuracy of a model by a slight margin but that mostly depends on the machine learning algorithm you are using. In examples like Adaboost such a model the effectivity of the model increases. This method is also called Asymmetric Adaboost. But this might not necessarily work with all algorithms.

Choice C): Assuming you have weighted the samples accordingly it should do the same as either choice A or choice B. I'll leave this for you to extrapolate based on my previous explanations.

user-116
  • 681
  • 4
  • 12