Bad metrics results by strong class imbalance in Credit card classification

Question

Hi i'm currently in the process of writing my bachelor's thesis and stuck at a some steps.
I've developed a few ML-Model (XGBoost, (Balanced) Random Forest, ElasticNet,...) on an extreme imbalanced data set (only about 0.2% of the data set belong to positive class). Almost all of my models have the same performance for the metrics that i chose:

ROC AUC 0.77-0.80
Recall for positive class: 0.80-0.88
PR AUC: 0.04 - 0.06
Precision for positive class: 0.01-0.02
Matthews Correlation Coefficient (MCC): 0.13-0.15
Brier Score: 0.11-0.13

I'm quite stressed out that the metrics which are normally sensitive to imbalance in data set are really bad. I have tried some sampling methods including some variations of SMOTE, Undersampling (for which i even implemented a cross-validation script to find out what undersampling rate would be the best) and even tried implementing class weights, but the results don't seem the get better.. If anyone has any suggestion it would mean the world to me! Also one Background information: the model should be a Classifier for credits and there are only two classes: good and bad credits. I've read in some forums that it's ok to have this kind of results if the Recall is more important and false positive (which is normally high due to the imbalance) is not so "expensive". But classifying good credits as bad credits is in fact, bad?
Thank you for reading and i would appreciate all the help!!!
-----------------------------------------------------------------
P.s: I have also want to try out some new metrics for this imbalanced classification problem. The suggested metrics are: Kappa; weighted-averaged Accuracy, F1-Score; macro-averaged Accuracy, F1-Score.
If anyone has a suggestion for metrics that I could use, I'd also appreciate it!

score 1 · Answer 1 · answered Sep 24 '24 at 00:18

I strongly recommend reading Frank Harrell's two blog posts about machine learning "classification" problems: Classification vs. Prediction and Damage Caused by Classification Accuracy and Other Discontinuous Improper Accuracy Scoring Rules.

Remember that your classification metrics like accuracy and $F_1$ score apply to the original model followed by a decision rule about how to assign a predicted value to a category. It might be that the original model is good but the decision rule is not. Indeed, Harrell argues in those two links that decisions rules are usually a bad idea or at least a premature step, as going from a prediction (e.g., $0.2$ predicted probability of default) to a decision (e.g., a category or a course of action) should incorporate knowledge of the consequences of correct and incorrect decisions. For example, it might be unlikely to get a traffic ticket if you run a red light late at night, but most people will sit at the light because of how furious they will be if they do get ticketed (let alone manage to get into a collision). That is, despite odds not favoring being ticketed or crashing your car, you stay at the light.

(In fact, despite there being just two categories, there might be more than two courses of action.)

I can see proceeding in two ways.

Do not do any classification. Only work with the predicted probabilities. Evaluate them with log loss, Brier score, McFadden's pseudo $R^2$, Efron's pseudo $R^2$, or any number of other measures of performance, such as those discussed by UCLA. Look into the calibration of those predictions, such as with the rms::val.prob function in R software. Even if these are not emphasized or discussed in your program (I can imagine various reasons why, with some being better than others), someone with good statistical acumen should like this. "Sweet, this student already know some of the good stuff from Harrell's RMS textbook," your grader might think. "This student is already working at a postgraduate level."

Particularly if your field is finance or business instead of statistics, it might be that you have to consider what to do with the predictions, not just make them. Give context to the predictions, even if they are low. Interpret them as a financial economist or as the owner or a business.

I discuss related notions here.

Bad metrics results by strong class imbalance in Credit card classification

1 Answers1