2

I am trying to solve a problem with an unbalanced data set. I have two classes, one is for patients with risk (1), the other for patients without risk (0). I have a larger number of patients without risk.

For analysis, I used methods such as:

  • RandomOverSampler
  • SMOTE
  • ADASYN
  • Borderline-SMOTE
  • RandomUnderSampler
  • TomekLinks
  • NearMiss
  • SMOTEENN
  • SMOTETomek

In addition to resampling and subsampling, I would also like to try other techniques for solving the problem with an unbalanced data set.

What other techniques can I try? For example ensemble technique, weighting classes and what else?

nwaldo
  • 500
  • 3
  • 13
Naty
  • 21
  • 2

1 Answers1

0

I agree with @picky_porpoise. You should consider if imbalanced data is really a problem. Potentially the issue is that you are concerned with metrics such as accuracy, which is a highly misleading metric for data with a significant imbalance. For example, if there are 100 subjects and 1 subject has the event indicator, your model will be 99% accurate. Consider f1, precision, and/or recall.

Additionally, since you are hoping to build a risk prediction model, you need to be concerned with model calibration. Using the sampling techniques that you mention will cause your model to be miscalibrated. There are multiple papers talking about this now, see here and here. Consider instead adjusting the model decision threshold based on the problem you are trying to solve. This will likely consulting with the subject matter experts / the stakeholders.

Additionally, you can consider other metrics such a net benefit, taken from decision science and is concerned with maximizing benefits vs costs in clinically related models.

healthydata
  • 171
  • 3