I have customer demographic data that include columns like: age, the first half of the postcode, occupation (there is a defined list of possible occupations), and more. Each month I get a new batch of 1000 rows of this type of data (which is not labelled) and I need to put this into my trained model to predict what item (out of 5 items) each person in the new batch data set is most likely going to buy (a multiclass classification problem).
Each time I receive this data, I compare the summary statistics between the old and new data, and investigate any changes in the distribution of the categorical variables using hypothesis testing. If my tests show that my new batch of data had vastly different summary stats, or distributions to my training set e.g.
The new batch targeted people under 25 only, whereas my training set contains all age groups. The new batch targeted people from a specific area of the UK, whereas my training set contains all possible locations in the UK. Would I need to:
Make any changes to my training set, or my overall workflow, to adjust for this?
As far as I know, this is data drift. Am I correct in saying that?
If the batch data coming in was labelled, so we knew what items these people bought, and there was a sizable difference in the proportion of each product sold, what could I do to quantify this instead of naively adding this new data to the training set and retraining my model?
Thanks