Let us assume the training of a BERT model.
An initial pre-train is performed with a large data set A.
Subsequently a finetuning is performed with a dataset B which is part of A, but now with labels of different classes for a text classification problem. This dataset B is split into the corresponding train, validation and test sets to obtain the classifier metrics.
Is this methodologically correct?
In a way, in the pre-train phase, the data used to validate the model has already been "seen", so in my opinion, the model is not being evaluated on "new" data and is somehow "cheating".
Elsewhere I have seen that the argument for using this data is that the training target in the pretrain phase and the finetuning phase are different and, more importantly, in the pretrain phase there are no labels.
Could someone confirm if it would be better not to use the finetuning data in the initial pretrain (whether train, evaluation or test)?