How to Combat Data Drift

Question

I have customer demographic data that include columns like: age, the first half of the postcode, occupation (there is a defined list of possible occupations), and more. Each month I get a new batch of 1000 rows of this type of data (which is not labelled) and I need to put this into my trained model to predict what item (out of 5 items) each person in the new batch data set is most likely going to buy (a multiclass classification problem).

Each time I receive this data, I compare the summary statistics between the old and new data, and investigate any changes in the distribution of the categorical variables using hypothesis testing. If my tests show that my new batch of data had vastly different summary stats, or distributions to my training set e.g.

The new batch targeted people under 25 only, whereas my training set contains all age groups. The new batch targeted people from a specific area of the UK, whereas my training set contains all possible locations in the UK. Would I need to:

Make any changes to my training set, or my overall workflow, to adjust for this?

As far as I know, this is data drift. Am I correct in saying that?

If the batch data coming in was labelled, so we knew what items these people bought, and there was a sizable difference in the proportion of each product sold, what could I do to quantify this instead of naively adding this new data to the training set and retraining my model?

Thanks

German C M · Accepted Answer · 2022-09-14T12:49:44.130

As you suggest, that situation could end up your monitoring system indicating a data drift. To evaluate this scenario, let's classify some types of data drift we could have:

features drift: given when the distribution of the input features (comparing training datasets VS prediction datasets) change enough (with a defined threshold) to raise an alert
target drift: distribution of the label values change when comparing training VS prediction distributions
concept drit: when the relation between the input features and target values change; it can arise when the label is redefined (for instance, the business rules for deciding clients who are active or inactive with some products; if the labeling rules changes, the same input feature values could be assigned to different target values before VS after redefining).

The way to monitor these drifts can be carried out via hypothesis testing as you say (e.g. Kolmogor-Smirnov test, Population Stability Index, etc), where you define the degree of warning thresholds.

The point is: what is the goal of having this drift monitoring? In general, the advice is to retraining your models when this type of drifts occur: it might or might not improve your model performance, but you make sure to have a model updated with fresh data. Other goal is also to have knowledge about your update data statistics of course.

Nevertheless, in this scenario of a subset of clients ages, although the model was trained with more "complete" datasets, you are making inference on a subset of the whole population used for training the model, so your model could still be valid enough (unless this new sceneario becomes the ususal one, so a more custom model could be trained with this new kind of more specific data).

How to Combat Data Drift

1 Answers1

Linked