2

I am using an imbalanced dataset (rare positive cases) to learn models for prediction and the final good AUC is 0.92 but the F1 score is very low0.2.

Is it possible to add some key features which will change the class probabilistic distribution and thus we can get a threshold to generate a higher F1?

Here is an example:

In my original model, I get a class probabilistic distributions shown below: enter image description here

I can adjust the threshold to make a better precision but meanwhile cut some recall off. It is due to the large overlapping area between two distributions.

Then I use an extreme dataset, i.e. include the target itself as a feature to learn. As a result, I can see I split the distribution completely disjointed. enter image description here

Dose it mean if I introduce a strong feature, I can split the distribution to some extent and thus promote the precision and thus f1 score? Or please advise how to improve precision under imbalanced classification issue.

Many thanks

Brian Spiering
  • 23,131
  • 2
  • 29
  • 113
LUSAQX
  • 783
  • 3
  • 10
  • 24

1 Answers1

1

Introducing a Strong Feature would definitely help as it is "Strong" :-) .. if you do not have a sure-success feature then you may try changing penalty of miss-classification to start with.

You may try synthetic (i.e. SMOTE) or non-synthetic (domain based) approaches to bulk up the lower class.

Also, if this is a very very rare class the repeated sampling techniques may work

CARTman
  • 121
  • 1