4

What is the benefit of imputing numerical or categorical features when using DT methods such as XGBoost that can handle missing values? This question is mainly for when the values are missing not at random.

An example of missing not at random categorical feature

  • Class 1: User has a red car
  • Class 2: User has a blue car
  • Class 3: User has no car (missing value)

In this case is it better to treat this feature as a binary 0/1 with NaN missing value, or as multi-label feature: 0,1 and -999 for missing?

The same question applies for a numerical feature that indicates age of user's car. Here, missing values indicate the user has no car. Is it better to keep missing values as NaN, or to impute these values? If imputing is better, should I impute using the median value and add an interaction feature for when the values are missing?

thereandhere1
  • 775
  • 1
  • 12
  • 25

1 Answers1

2
  1. it is boring and repetitive to say - so sorry - but it depends what are you trying to predict and if it is make sense to do imputation at all. If you are trying to predict the car price probably it is important to know what colour it is but if you are trying to predict whether that person will potentially default his/her loan - the car color is not relevant (unless perhaps, people who have pink cars are so rich and there is absolutely no risk to default!)

  2. is there any cost in adding new categorical feature ? e.g. why not keeping both features, both the color as well as whether the person owns a car. of course this depends on the frequency of the missing value. Later the new feature (whether the person has a car can be used to make other interactive features)

  3. regarding numerical, when it is a boosting method it does not matter because these methods inherently bin the numerical values. For example Catboost claim to have a better and more efficient binning algorithm.

  4. similar to 3 - it won't hurt if you replace the age with some really odd numbers, e.g. -99 because there is no normalization in mid-way to get skewed. The best practice is to see what percentage of the value are missing, and whether imputing will improve your performance - in short: it is very experimental.

  5. all your "meaningful" experiments will tell you retrospectively what helped. Therefore, you are highly recommended to use pipelines which help you to reproduce all steps.

  6. there are also more complicated ways of imputation - for example, making a predictive model to predict "age" in your example and then use that predicted age as an input for your second model. of course all has to be through a clean-through cross-validation to make sure there is no leakage.

  7. Go as inclusive as possible - add all sort of features generously - e.g. in your numerical value - have all medium, mean and ... and afterwards decide which made a better sense. If there was a statistician here I would have been knocked out because there is a risk of over fitting your test data ! so I hope you have an ocean of data.

user702846
  • 361
  • 2
  • 16