Predicting contract churn/cancellation: Great model results does not work in the real world

Question

I'm busy with a supervised machine learning problem where I am predicting contract cancellation. Although a lengthy question, I do hope someone will take the time as I'm convinced it will help others out there (I've just been unable to find ANY solutions that have helped me)

I have the following two datasets:

1) "Modelling Dataset"

Contains about 400k contracts (rows) with 300 features and a single label (0 = "Not Cancelled", 1 = "Cancelled").

Each row represents a single contract, and each contract is only represented once in the data. There are 350k "Not Cancelled" and 50k "Cancelled" cases.

Features are all extracted as at a specific date for each contract. This date is referred to as the "Effective Date". For "Cancelled" contracts, the "Effective Date" is the date of cancellation. For "Not Cancelled" contracts, the "Effective Date" is a date say 6 months ago. This will be explained in a moment.

2) "Live Dataset"

Contains 300k contracts (rows) with the same list of 300 features. All these contracts are "Not Cancelled" of course, as we want to predict which of them will cancel. These contracts were followed for a period of 2 months, and I then added a Label to this data to indicate whether it actually ended up cancelling in those two months: 0 = "Not Cancelled", 1 = "Cancelled"

The problem:

I get amazing results on the "Modelling Dataset" (random train/test split) (eg Precision 95%, AUC 0.98), but as soon as that model is applied to the "Live Dataset", it performs poorly (cannot predict well which contracts ends up cancelling) (eg Precision 50%, AUC 0.7).

On the Modelling Dataset, the results are great, almost irrespective of model or data preparation. I test a number of models (E.g. SkLearn random forest, Keras neural network, Microsoft GbmLight, SkLearn Recursive feature elimination). Even with default settings, the models generally perform well. I've standardized features. I've binned features to attempt improving how well it will generalize. Nothing has help it generalize to the "Live Dataset"

My suspicion:

In my mind, this is not an over-training issue because I've got a test set within the "Modelling Dataset" and those results are great on the test set. It is not a modelling or even a hyper-parameter optimization issue, as the results are already great.

I've also investigated whether there are significant differences in the profile of the features between the two datasets by looking at histograms, feature-by-feature. Nothing is worryingly different.

I suspect the issue lies therein that the same contracts that are marked as "Not Cancelled" in the "Modelling Dataset", which the model trains to recognize "Not Cancelled" of course, is basically the exact same contracts in the "Live Dataset", except that 6 months have now passed.

I suspect that the features for the "Not Cancelled" cases has not changed enough to now make the model recognize some of them as about to be "Cancelled". In other words, the contracts have not moved enough in the feature space.

My questions:

Firstly, does my suspicion sound correct?

Secondly, if I've stated the problem to be solved incorrectly, how would I then set up the problem-statement if the purpose is to predict cancellation of something like contracts (when the data on which you train will almost certainly contain the data one which you want to predict)?

For the record, the problem-statement I've used here is similar to the way others have done this. And they reported great results. But I'm not sure that the models were ever tested in real live. In other cases, the problem to be solved was lightly different, e.g. hotel booking cancellations, which is different because there a stream of new incoming bookings and booking duration is relatively short, so no bookings in common between the modelling and live dataset. Contracts on the other hand have long duration and can cancel at any time, and sometimes never.

score 3 · Answer 1 · answered Jun 14 '17 at 21:05

If your model makes a prediction 6 months into the future, then it doesn't make sense to judge its performance before 6 months. If only 2 months have passed, then possibly 2/3 of the true positives have yet to reveal their true nature and you are arriving a premature conclusion.

To test this theory, I would train a new model to predict 2 months out and use that to get an approximation of live accuracy while your wait 4 more months for the first model. Of course, there could be other problems, but this is what I would try first.

score 2 · Answer 2 · answered Jun 14 '17 at 15:55

Its hard to answer without a good look at the data.
But if I had to guess, your point seems valid. (considering there's no problem with cross validation methods or leaks)

If you are "measuring" the contract features at different points in time, there might be a high bias that the features of the cancelled contracts, measured at the point in time in which they were cancelled, might be very different from the "initial" features of those same contracts.

Hence, your modelling wld be learning how to predict a contract is being cancelled at the given date it is cancelled, and not prior to it, thats why it wouldnt be working properly on your "real world data".

If you can, try using the data from the moment the contract was set (initialized), to build your model.

score 0 · Accepted Answer · answered Mar 05 '19 at 09:04

This is some time after the question, but I thought worthwhile to include it. The solution to get similar results between the datasets was to include different data in the "modelling data"of which the training data is a subset.

Instead of only including each contract once in the data, I had to include every contract multiple times, e.g. from 2016/01/01 up to cancellation date (if cancelled) or today (if not cancelled). So contract included at many effective dates.

In each case, the label would now be whether a cancellation had occurred from the effective date of that record within a fixed period of interest (eg 1 month). So "1" for those that did cancel within 1 month and "0" for those that did not cancel within 1 month.

Now the model learns to recognise whether a contract will likely cancel within a month.

The results were not amazing, but a least consistent between modelling and live sets. But it was actually expected as cancellation of long-term contracts over the short term is just difficult to predict in many cases.

Predicting contract churn/cancellation: Great model results does not work in the real world

3 Answers3