When is the right moment to split the dataset?

Question

I would like to ask a question about the dataset splitting. If I have a dataset can I perform preprocessing (imputation, scaling, ecc.) on the entire dataset and then splitting it into train and validation? If I don't have the test set I often create new inputs by myself to test it.

Or should I split the dataset firstly and then performing preprocessing on the training set and applying the scaler and the imputer to the validation one?

score 1 · Answer 1 · answered Feb 19 '23 at 14:15

In principle, you can do many preprocessing activities (e.g., converting data types, removing NaN values, etc.) on the entire dataset since it does not make a difference whether these steps are separated into training and test set.

However, when for instance using a Standard Scaler, you should fit the scaler on the training data (usually including the validation set) and transform both the training and test data on this fitted model. This prevents information from the unseen test set to spill over into the training process. For some further discussion on the fit and transform of a Standard Scaler, you can look here: StandardScaler before or after splitting data - which is better?.

The same is true for removing outliers or imputing missing values (e.g., by the mean of the respective column). In this case, you should use the respective statistic of the training data for imputation and thus split the dataset before imputation.

Usually, the validation is treated as part of the training set (with K-fold cross-validation, there might not even be a fixed validation set), while the test set is separated as early as possible.

Hope this gives you a bit of guidance.

score 0 · Answer 2 · answered Feb 19 '23 at 14:26

It's better to split the data into training and testing sets before doing things like scaling and imputation. This is because these steps are usually done using parameters learned from the training set, and then the same changes/parameters are applied to the testing set.

The test data should be representative of the data that the model is expected to encounter in real-life/production scenarios. This means that the test data should be in the same format and distribution as the data that the model will be used on in the future. This helps to ensure that the model will generalize well and make accurate predictions on new, unseen data.

If you do the split after these steps, you might accidentally include information from the test set in the training set (data leakage), which can make it look like the model works better than it actually does.

When is the right moment to split the dataset?

2 Answers2