Is it methodologically correct to use the data to be used for finetuning in the pretrain phase of the BERT model?

Question

Let us assume the training of a BERT model.

An initial pre-train is performed with a large data set A.

Subsequently a finetuning is performed with a dataset B which is part of A, but now with labels of different classes for a text classification problem. This dataset B is split into the corresponding train, validation and test sets to obtain the classifier metrics.

Is this methodologically correct?

In a way, in the pre-train phase, the data used to validate the model has already been "seen", so in my opinion, the model is not being evaluated on "new" data and is somehow "cheating".

Elsewhere I have seen that the argument for using this data is that the training target in the pretrain phase and the finetuning phase are different and, more importantly, in the pretrain phase there are no labels.

Could someone confirm if it would be better not to use the finetuning data in the initial pretrain (whether train, evaluation or test)?

score 1 · Accepted Answer · answered Feb 15 '24 at 09:45

I'd say that it's correct.

BERT pre-training doesn't use labels, as it uses two self-supervised objectives:

masked language model (mask a word in the middle of a sentence, and guess what it is)
next sentence prediction (guess which sentence comes next in the corpus)

I don't think there is a "right" or "wrong" way to do this here. You want to have a pre-trained BERT model, and this could come from an unknown dataset, or a dataset that you have. The point of a model like BERT is that it's able to handle large amounts of data, so it's likely that you will get some "similar" data in your pre-training and training.

As long as you test your end-model properly and check that it's robust, you should be grand

Is it methodologically correct to use the data to be used for finetuning in the pretrain phase of the BERT model?

1 Answers1