How to split up my dataset in a train and testset, in order to prevent data leakage?

Question

I realize that this could be considered a duplicate of this question, Is using samples from the same person in both trainset and testset considers being a data leakage?, where it is stated that "The testing data should not be linked to the training data in any way" in order to prevent data leakage. However, how should I continue if it is not possible to split the dataset, in a train and testset, where there is no link between them in any way?

My dataset exists of 839 time series samples which i want to classify according to 14 classes. However, this is an unbalanced dataset as can be seen in Figure 1. The dataset is composed of the recordings of 16 volunteers. Unfortunately, for a number of these classes, data is only available for a number of volunteers. Figure 2 shows that for some classes only data of one specific volunteer is available. In the figure, the number gives the amount of volunteers of which samples are available for that specific class.

I want to validate my machine learning models using a seperate testset. My initial idea was to leave the data of one volunteer out for the testset. But as stated above, this is not possible since for two classes there is only data available from one volunteer. The second best approach that I could come up with is leaving the data from one volunteer (which has samples available for 12 of the 4 classes) out alongside a selected few samples of the two minority classes that are not present in the samples of this specific volunteer. This means that my testset would consist of samples which are completely seperated from the trainset for 12 of the 14 classes. The samples of the testset for the 2 other classes would be from the same volunteer on which the models are trained (since there is only data available of one volunteer for these classes).

As stated above, I know that this will introduce some form of data leakage. Is there a good alternative that won't introduce data leakage (besides adding more data)? If not, would the StratifiedKFold method from sklearn be a good alternative for validating my model? (With the StratifiedKFold i would use a pipeline in order to perform preprocessing only on the training part of the folds.)

I would be really glad for an answer and some more explanation on data leakage for this specific case.

Class imbalance Data availability of volunteers per class

score 1 · Answer 1 · answered Dec 24 '20 at 19:32

In multi-patient datasets a typical cross validation split is leave-subject-out, check this out: What is difference between leave one subject out and leave one out cross validation Basically, each subject used as test subjects in different folds.

Your dataset has a few issues. Firstly, depending on your use case there might or not be data leakage. For example, if it is possible to firstly observe enough time series samples of a subject for training, say the first half, then it's ok to use the second half for testing. Secondly, two of your classes appear on two single subjects. For these two, you can't design an algorithm on based on other subject's data. Thirdly, and more importantly, most classes don't have enough samples. For example, the eat_soup class has only 3 samples.

In summary, you should use leave-subject-out but on the classes that appear on a substantial number of subjects and there is enough samples in total.

How to split up my dataset in a train and testset, in order to prevent data leakage?

1 Answers1