4

I have a labelled training dataset DS1 with 1000 entries. The targets (True/False) are nearly balanced. With sklearn, I have tried several algorithms, of which the GradientBoostingClassifier works best with F-Score ~0.83.

Now, I have to apply the trained classifier on an unlabelled dataset DS2 with ~ 5 million entries (and same features). However, for DS2, the target distribution is expected to be highly unbalanced.

Is this a problem? Will the model reproduce the trained target distribution from DS1 when applied on DS2?

If yes, would another algorithm be more robust?

Stephen Rauch
  • 1,831
  • 11
  • 23
  • 34

3 Answers3

3

Is this a problem?

No. not at all.

Will the model reproduce the trained target distribution from DS1 when applied on DS2?

No, not necessarily. If

  1. Balanced set DS1 is a good representative of imbalanced (target) set DS2, and
  2. Classes are well-separated (pointed out by @BenReiniger), which holds easier in higher dimensions,

then model will generate labels with a ratio close to imbalanced DS2 not balanced DS1. Here is a visual example (drawn by myself):

enter image description here

As you see, in the good training set, predictions resemble the real-world ratio, even though model is trained on a balanced set, of course, if classifier does a good job of finding the decision boundaries.

Esmailian
  • 9,553
  • 2
  • 34
  • 49
1

For prediction, the GradientBoostingClassifier will only take those features in account that you fed it during training and it will then classify each observation on its own. That means that usually you don't have to worry about the target-distribution of your prediction-dataset, as long you trained your model on a sufficiently extensive training dataset.

Georg Unterholzner
  • 1,229
  • 10
  • 21
1

A GBM will ultimately try to split your data into rectangular regions and assign each one a constant predicted probability, the proportion of positive training examples in that region. So yes, on the whole the model has baked in the training sample's average response rate.

I think that effect will be lessened if your data is particularly cleanly separable: if each rectangular region is pure, and your test data just happens to be more heavily inclined toward the negative regions, then it will naturally get closer to "the right" answer.

I'm not sure about other models that would be more robust in this way...an SVM probably, not being naturally probabilistic in the first place.

If your context is downsampling, logistic regression has a well-known adjustment for exactly this problem. The same adjustment (to log-odds) seems likely to help in the GBM context as well, though I'm not aware of any analysis to back it up.

Ben Reiniger
  • 12,855
  • 3
  • 20
  • 63