0

I have a dataset with 95% false and 5% true labels, some 200000 samples overall, I'm fitting a LightGBM model. I mainly need to focus on high precision and have low number of false positives, I don't care for accuracy much.

I have tried playing around with decision boundary after fitting and increasing weight for the positive class, this helps but I wonder if there is something else I could do.

Because the dataset is very unbalanced I think that the model is spending a lot of effort on the TN/FN boundary which I really don't care about. Also my intuition is that the standard cross-entropy loss is implicitly more focused on accuracy rather than precision.

I wonder if I could perhaps somehow pre-filter my dataset to maybe throw away 50% samples, but increase the initial T/F ratio. Or maybe this is what LightGBM already does and what I want is fundamentally impossible. Or perhaps there is an alternative to cross-entropy loss that I could use.

Fireant
  • 1
  • 1

1 Answers1

1

During training you can place "more emphasis" on a given class (or sample) by either:

  • class weights to weight the error associated to a given class. There are usually used to balance the error associated to imbalanced classes. From the doc (here) there is a class_weight argument you can pass (assuming you're using the scikit-learn API in Python.)
  • Otherwise you can oversample the minority class (so duplicating the data) to balance the data.

Another approach - but this one after training - is to sort of calibrate the model by using validation data. Assuming the model predicts in a continuous [0, 1] range, you can tune a threshold $t$ such that a prediction $p$ is considered positive if $p>t$, therefore achieving a given precision or recall rate, e.g. $90\%$ precision, according to the chosen value of $t$. In this way you can also account for imbalance by trading-off errors, indeed.

Luca Anzalone
  • 628
  • 2
  • 10