3

I am working on a certain insurance claims related data-set to classify newly acquired customers as either claim or non-claim.

The basic problem with the training set is the extremely large imbalance in claim and non-claim profiles, with the claims amounting to just ~ 0.26% of the training set. Also, most claims are concentrated largely towards the final few years (data is sorted by date).

On applying Logistic Regression or even Random Forests, to train on 70% of the data, the test results were well below satisfactory.

I've been looking at alternate models and I came across this blog post. A particular line that got my attention is:

GBM is better than rf_t. In the paper, the best classifier for two-class data sets was avNNet_t, with 83.0% accuracy

Although, no real clarification was given as to why that was. Can someone help me open this "blackbox"? Which model really works (in the case described above) and why?

neural-nut
  • 1,803
  • 3
  • 18
  • 28

1 Answers1

2

I believe in your case, predicting claim is more important than no claim. As you said you have you got 70% Accuracy on the training data, most of the time you might be doing wrong predictions in claim case because of less records, comparatively, what I would suggest is to make the data set balance or select a random balanced data set (20% each of claim and non-clan) and train a model using previous techniques you have applied and test it on the remaining data. If possible use different error measures with respect to your business case such as giving weights to the outcomes. If the accuracy is not improved, you can implement GBM techniques on this data. Most of the times GBM makes better predictions because it increases the randomness (white noise) in residuals by decreasing the similarity among residuals. You can apply many different models on this data and check if the accuracy is improved, eventually we should be able to understand the model to explain someone why they should use this model. Moreover, if you use feature engineered data with different models, there is a high probability that you will do better than the previous models. However, this depends on your business understanding.