4

Inspired by @Dave's question "Why does data science see class imbalance as a problem for supervised learning when statistics does not?", I am re-posting a question I posed on the stats SE to see if there is a useful answer from the Data Science community. If class imbalance is seen as a problem in Data Science, there really ought to be a diagnostic procedure to detect it?

In cases where there is a substantial difference in relative class frequencies, it could be that the density of the minority class is never higher than the density of the majority class anywhere in the attribute space. Here is a simple example using univariate Gaussian classes, with an imbalance ratio of 1:9.

enter image description here

In this case, if my classifier assigns all patterns to the majority class, it is doing exactly the right thing, and there is no problem to solve.

In this case, we know the true data generating process, so we know that the classifier is doing the right thing. However in general we don't know the true distributions of positive and negative classes, so we don't know whether the classifier is doing the right thing or not.

So my question is: In practical applications, how do we decide if we have a class imbalance problem, or whether the classifier is just giving the correct answer, to the question as posed?

Full disclosure: My intuition is that in most cases, especially when the data is not unduly scarce, the classifier is doing exactly what it should do and there is no class imbalance problem. I am primarily interested to hear how other practitioners and researchers diagnose class imbalance problems.

Dikran Marsupial
  • 650
  • 3
  • 11

0 Answers0