As BERT is bidirectional (uses bi-directional transformer), is it possible to use it for the next-word-predict task? If yes, what needs to be tweaked?
1 Answers
BERT can't be used for next word prediction, at least not with the current state of the research on masked language modeling.
BERT is trained on a masked language modeling task and therefore you cannot "predict the next word". You can only mask a word and ask BERT to predict it given the rest of the sentence (both to the left and to the right of the masked word).
This way, with BERT you can't sample text like if it were a normal autoregressive language model. However, BERT can be seen as a Markov Random Field Language Model and be used for text generation as such. See article BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model for details. The authors released source code and a Google Colab notebook.
Update: the authors of the MRF article discovered their analysis was flawed and BERT is not a MRF, see this
Update 2: despite not being meant for next word prediction, there have been attempts at using BERT that way. Here you can find a project that does next word prediction with BERT, XLNet, RoBERTa, etc.
- 28,203
- 1
- 49
- 83