12

I am working on a multi-class classification problem, with ~65 features and ~150K instances. 30% of features are categorical and the rest are numerical (continuous). I understand that standardization or normalization should be done after splitting the data into train and test subsets, but I am not still sure about the imputation process. For the classification task, I am planning to use Random Forest, Logistic Regression, and XGBOOST (which are not distance-based).

Could someone please explain which should come first? Split > imputation or imputation>split? In case that split>imputation is correct, should I follow imputation>standardization or standardization>imputation?

Sarah
  • 621
  • 2
  • 5
  • 17

2 Answers2

18

Always split before you do any data pre-processing. Performing pre-processing before splitting will mean that information from your test set will be present during training, causing a data leak.

Think of it like this, the test set is supposed to be a way of estimating performance on totally unseen data. If it affects the training, then it will be partially seen data.

I don't think the order of scaling/imputing is as strict. I would impute first if the method might throw of the scaling/centering.

Your steps should be:

  1. Splitting
  2. Imputing
  3. Scaling

Here are some related questions to support this:

Imputation before or after splitting into train and test?

Imputation of missing data before or after centering and scaling?

Simon Larsson
  • 4,313
  • 1
  • 16
  • 30
3

If you impute/standardize before splitting and then split into train/test you are leaking data from your test set (that is supposed to be completely withheld) into your training set. This will yield extremely biased results on model performance.

The correct way is to split your data first, and to then use imputation/standardization (the order will depend on if the imputation method requires standardization).

The key here is that you are learning everything from the training set and then "predicting" on to the test set. For nornalization/standardization, you learn the sample mean and sample standard deviation from the training set, treat them as constants, and using these learned values you transform the test set. You don't use the test set mean or the test standard deviation in any of these calculations.

For imputation the idea is similar. You learn the required parameters from the training set only and then predict the required test set values.

This way your performance metrics will not be biased optimistically by your methods inadverdently seeing the test set observations.

aranglol
  • 2,236
  • 9
  • 15