2

When BERT is used for masked language modeling, it masks a token and then tries to predict it.

What are the candidate tokens BERT can choose from? Does it just predict an integer (like a regression problem) and then use that token? Or does it do a softmax over all possible word tokens? For the latter, isn't there just an enormous amount of possible tokens? I have a hard time imaging BERT treats it like a classification problem where # classes = # all possible word tokens.

From where does BERT get the token it predicts?

Nick Koprowicz
  • 223
  • 1
  • 3
  • 10

1 Answers1

2

There is a token vocabulary, that is, the set of all possible tokens that can be handled by BERT. You can find the vocabulary used by one of the variants of BERT (BERT-base-uncased) here.

You can see that it contains one token per line, with a total of 30522 tokens. The softmax is computed over them.

The token granularity in the BERT vocabulary is subwords. This means that each token does not represent a complete word, but just a piece of word. Before feeding text as input to BERT, it is needed to segment it into subwords according to the subword vocabulary mentioned before. Having a subword vocabulary instead of a word-level vocabulary is what makes it possible for BERT (and any other text generation subword model) to only need a "small" vocabulary to be able to represent any string (within the character set seen in the training data).

noe
  • 28,203
  • 1
  • 49
  • 83