3

Recently I've been thinking about the proper use of encoding within cross-validation scheme. The customarily advised way of encoding features is:

  1. Split the data into train and test (hold-out) set
  2. Fit the encoder (either LabelEncoder or OneHotEncoder) on the train set
  3. Transform both the train and test set using fitted encoder.

This way is claimed to prevent from any data-leakage. However, this seems to often be omitted during cross-validation. Let's suppose I am performing cross validation on the aforementioned train set. If I encode train set and then perform cross-validation it doesn't really mimic the steps above. Shouldn't the encoding be performed "within" cross-validation then? For example, assuming that we perform 5-fold cross-validation, shouldn't we fit the encoder on 4 folds and transform on 5th fold in each cross-validation step? I believe it is what's usually done in target encoding, but not really with label or one-hot encoding.

My questions therefore are:

  1. Am I right about the necessity to fit the encoder on 4 folds and not on the 5th validation fold in each cross-validation step if we really want to prevent overfitting?
  2. If not, why is it really necessary to perform all 3 steps mentioned before while dealing with train and test (hold-out) set?
jakes
  • 95
  • 13

1 Answers1

1

You're right, the encoding step itself can be a source of data leakage and normally it should be done inside the CV loop using only the current training set, as you describe.

The reason is indeed the one you mention in the comment: if there is a class label or a feature category which doesn't appear by chance in a particular training set during CV, the model is not supposed to know that this class/category even exists.

In general I would think that this issue can only decrease the performance on the test set, so it's probably not as serious as other kinds of data leakage. Still, it's definitely a cleaner experimental design to encode using only the training set.

A closely related issue in NLP is when the system is not designed to deal with out-of-vocabulary (OOV) words: if all the words in both the training and test set are encoded (same mistake), then it wrongly looks as if any text can be fully encoded, potentially causing to bad surprises later.

That being said, it's usually a good idea to discard rare features or label values, and if this is done then the result should be the same using either the proper method or the sloppy one.

Erwan
  • 26,519
  • 3
  • 16
  • 39