ROC vs PR-score and imbalanced datasets

Question

I can see everywhere that when the dataset is imbalanced PR-AUC is a better performance indicator than ROC. From my experience, if the positive class is the most important, and there is higher percentage of positive class than the negative class in the dataset, then PR-AUC seems to be very biased. Actually, the higher the percentage of positive class the higher the PR-AUC (PR-AUC score is inflated). Would it make sense to say, PR-AUC is good for imbalanced datasets when the positive class is a small percentage comparing to the negative class, and ROC score is better performance indicator when the percentage of positive class is much smaller than the negative class?

For example, I have a model tested on a dataset, that the positive class is 88%, and it is higher than the negative class. PR-AUC is 99%, and ROC is 87%. On the contrary, when this model is tested on a dataset where the percentage of negative class is now higher than the positive class (PR-AUC is now 67% and ROC is 76%). This second case aligns with the literature (see my first comment). I have tested in many test sets, and I can agree that PR-AUC is less biased when negative class out weights positive class, but it looks biased when positive class out weights negative class (it gives me 99% of performance). Please consider that training is done using undersampling technique, to deal with imbalancing.

Thank you in advance.

score 3 · Answer 1 · answered Jan 08 '25 at 10:48

The way to go is to take into account costs in learning: https://scikit-learn.org/dev/auto_examples/model_selection/plot_cost_sensitive_learning.html

For a start you can use target imbalance as a proxy for cost imbalance and weight classes accordingly.

Then after choose a metric that is relevant for your problem. Notably most of the time you will take a decision. And the curves you mention considers all binary thresholds. You are probably better off selecting one threshold.

rehaqds · Answer 2 · 2025-01-27T13:30:42.283

Is it a good idea to use AUPR when the positive class is the majority class ?

To simplify, let's consider that your dataset has 90% of positive class and that in total you have 1000 samples. If we consider a naive classifier that classifies everything as class 1 (positive class) then:

TP=900, FP=100 and FN=TN=0
so Recall = TP/(TP+FN) = 1.0
and Precision = TP/(TP+FP) = 0.9
We see that Precision & Recall look very good but obviously this is not a good classifier.

In the same way, AUPR (aka PR-AUC, the area under the Precision/Recall cruve) is not very useful because AUPR > 0.9 as it is the function of Precision over Recall and when Recall=0 we have Precision=1 and when Recall=1 we have Precision=ratio of positive class (0.9) [it corresponds to the naive model in the previous paragraph]. To better visualize this, let's take an example.

To illustrate the above point, I took a plot below from scikit-learn website, for the present case you have to imagine that the curve is translated upward so that the right-most point is at coordinate (1, 0.9) instead of (1, 0.52) [in scikit example they took random numbers so the % of positive is almost 0.5 in the plot but here we have ~0.9]

So we see that for all models (even the bad ones) the AUPR will be quite high (>0.9).

What about AUC ?

AUC curve is TPR (True Positive Rate) by FPR (False Positive Rate). For the naive model, TPR = TP/(TP+FN) = 1 and FPR = FP/(FP+TN) = 1. So we have a False Positive Rate of 1 which shows that this is not a good classifier. As a consequence, in this case, ROC is more more useful than AUPR.

But not so useful neither because it considers that both classes are as important and in most cases it's not the reality.

Moreover AUC and AUPR are like an average over all decision thresholds (between 0 and 1) for a given algo. It can be useful to compare different algorithms but when you make predictions you will only care about one threshold (one point on curve) that you will need to choose (and if you don't choose, scikit-learn will choose for you 0.5).

Better metrics

Among classical metrics, when data is imbalanced, F1-score is often used as it can make a compromise between Precision and Recall concentrating on the positive class. But most often we try to avoid either FP or FN as the cost for these 2 types of errors is usually quite different. In your case, you want to minimize FP so F_beta with beta <1 (like 0.5) is more interesting than F1-score as it will give more weights to Precision compared to Recall.

But actually, it would certainly be better if possible, you could create your own metric based on your knowledge of the cost of the FN and the FP. For example using 5 * FP + FN if you can estimate that the cost of one FP is 5 times worse than one FN.

Dealing with imbalance

There are 2 families of methods: cost-sensitive and resampling.
And there is not really a limit but it seems that often we talk about imbalanced dataset when there are 80/90% of majority class. But the most important is actually the quantity rather than the percentage. If you have 60% imbalanced with 60 samples in majority class and 40 samples in minority class it is much worse than having 99% imbalance with 1 million samples in the minority class.

Actually your dataset is not so imbalanced and you are more interested in the majority class than in the minority one. In any case the scores without dealing with imbalance should be computed first as a baseline. As a matter of fact resampling doesn't always improve the score (and the resampling should not be applied to the validations/test sets).

Between both methods I had more success with cost-sensitive ones but it looks like many people like resampling.

In many algorithms (scikit-learn, keras, xgboost, ...) you have a parameter class_weight or similar which enables you to add different coefficients for positive and negative class to the different samples when the loss (or impurity for tree-based algorithms) is computed, in order to give more importance to the minority samples.

Or you could look for (tuning) the best decision threshold (limit in predicted proba between positive and negative class, 0.5 by default in scikit-learnn) to optimize your own custom metric as seen in the previous part.

score 1 · Answer 3 · answered Jan 09 '25 at 11:21

I got a very good answer to this by James Carmichael on the comments section on January 8, 2025 at 2:22 am. Please find this in the link below: https://machinelearningmastery.com/tour-of-evaluation-metrics-for-imbalanced-classification/#comment-729653

As a rule of thumb he suggests: – PR-AUC is a strong choice for imbalanced datasets where the positive class is rare. – For datasets where the positive class is dominant but still important: – Consider using Balanced Accuracy, F1-Score, MCC, or ROC-AUC depending on your priorities: – F1-Score: If precision and recall for the positive class are the focus. – Balanced Accuracy/MCC: If both classes are important but the positive class is the primary target. – ROC-AUC: If you need a general assessment of separability.

ROC vs PR-score and imbalanced datasets

3 Answers3