For any NER task, we need a sequence of words and their corresponding labels. To extract features for these words from BERT, they need to be tokenized into subwords.
For example, the word 'infrequent' (with label B-count) will be tokenized into ['in', '##fr', '##e', '##quent']. How will its label be represented?
According to the BERT paper:
We use the representation of the first sub-token as the input to the token-level classifier over the NER label set.
So I assume, for the subwords ['in', '##fr', '##e', '##quent'] , the label for the first subword will either be this ['B-count', 'B-count', 'B-count', 'B-count'] where we propagate the word label to all the subwords. Or should it be ['B-count', 'X', 'X', 'X'] where we leave the original label on the first token of the word, then use the label “X” for subwords of that word.
Any help will be appreciated.