22

I have a model that does binary classification.

My dataset is highly unbalanced, so I thought that I should balance it by undersampling before I train the model. So balance the dataset and then split it randomly. Is this the right way ? or should I balance also the test and train dataset ?

I tried balancing only the whole dataset and I get train accuracy of 80% but then on the test set I have 30% accuracy. This doesn't seem right ?

But also I don't think that I should balance the test set because it could be considered as bias.

What is the right way to do this?

Thanks

UPDATE: I have 400 000 samples, 10% are 1s and 90% 0s. I cannot get more data. I tried to keep the whole dataset but I don't know how to split it into train and test set. Do I need the same distribution in the train and test dataset ?

lads
  • 423
  • 1
  • 5
  • 8

4 Answers4

24

Best way is to collect more data, if you can.

Sampling should always be done on train dataset. If you are using python, scikit-learn has some really cool packages to help you with this. Random sampling is a very bad option for splitting. Try stratified sampling. This splits your class proportionally between training and test set.

Run oversampling, undersampling or hybrid techniques on training set. Again, if you are using scikit-learn and logistic regression, there's a parameter called class-weight. Set this to balanced.

Selection of evaluation metric also plays a very important role in model selection. Accuracy never helps in imbalanced dataset. Try, Area under ROC or precision and recall depending on your need. Do you want to give more weightage to false positive rate or false negative rate?

aathiraks
  • 704
  • 4
  • 12
6

You problem is very common and many data scientists are struggling with there kind of issues.

In this blog post, the author explain very nicely what to do. Those are the main notes:

1. Can You Collect More Data?

2. Try Changing Your Performance Metric:

Accuracy is not the metric to use when working with an imbalanced dataset. We have seen that it is misleading.

There are metrics that have been designed to tell you a more truthful story when working with imbalanced classes.

Precision: A measure of a classifiers exactness. Recall: A measure of a classifiers completeness F1 Score (or F-score): A weighted average of precision and recall.

3. Resampling Your Dataset

You can change the dataset that you use to build your predictive model to have more balanced data.

This change is called sampling your dataset and there are two main methods that you can use to even-up the classes:

  • You can add copies of instances from the under-represented class called over-sampling (or more formally sampling with replacement), or

  • You can delete instances from the over-represented class, called under-sampling.

4. Generate Synthetic Samples

simple way to generate synthetic samples is to randomly sample the attributes from instances in the minority class.

5. Try Different Algorithms

As always, I strongly advice you to not use your favorite algorithm on every problem. You should at least be spot-checking a variety of different types of algorithms on a given problem.

Gal Dreiman
  • 163
  • 5
1

It all depends on what's your objective. Do you aim at precision or recall?

You are right the distribution of your training Data (depending always on the model and the hyper-parameters) will bias your model accordingly to it. Supplying a training set where most of the instances (i.e. 90%) are labelled as 0's, will probably label in the test set most of them as 0's. Hence, if one would like to detect the 1's should bias the sample in order to have more of these. There are many ways of doing that and beyond changing your training distribution. Firstly, oversampling, undersampling or even better, using ensemble models where each model may have all the 1s and some 0s. Secondly, one can tune depending on the classifier of choice various hyper-parameters which are responsible for constraining the majority class to take over.

20-roso
  • 708
  • 1
  • 5
  • 15
-1

As mentioned in most of the answers that there are various ways of dealing with skewed data. I would just like to highlight that SMOTE is one of the recommended ways to overcome this skewness.

cottontail
  • 312
  • 3
  • 4
  • 13
Rahul Sharma
  • 182
  • 7