What should be the labels for subword tokens in BERT for NER task?

Question

For any NER task, we need a sequence of words and their corresponding labels. To extract features for these words from BERT, they need to be tokenized into subwords.

For example, the word 'infrequent' (with label B-count) will be tokenized into ['in', '##fr', '##e', '##quent']. How will its label be represented?

According to the BERT paper:

We use the representation of the first sub-token as the input to the token-level classifier over the NER label set.

So I assume, for the subwords ['in', '##fr', '##e', '##quent'] , the label for the first subword will either be this ['B-count', 'B-count', 'B-count', 'B-count'] where we propagate the word label to all the subwords. Or should it be ['B-count', 'X', 'X', 'X'] where we leave the original label on the first token of the word, then use the label “X” for subwords of that word.

Any help will be appreciated.

score 3 · Accepted Answer · answered Jun 01 '20 at 12:54

Method 2 is the correct one.

Leave the actual label of the word only in the first sub-token, and the other sub-tokens will have a dummy label (which in this case is 'X'). The important thing is that when calculating the loss (e.g., CELoss) and metrics (e.g., F1), this 'X' labels on the sub-tokens are not taken into account.

This is also the reason why we don't use method 1 is that otherwise, we would be introducing more labels of the type [B-count] and affecting the support number for such a class (which would make a test set no longer comparable with other models that do not increase the number of labels for such class).

score 0 · Answer 2 · answered Apr 28 '23 at 08:12

i am using huingface for NER. To solve this problem, you can refer to this website:https://huggingface.co/docs/transformers/tasks/token_classification Specifically, the code for aligning labels and final evaluation is as follows

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
labels = []
for i, label in enumerate(examples[f&quot;ner_tags&quot;]):
    word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
    previous_word_idx = None
    label_ids = []
    for word_idx in word_ids:  # Set the special tokens to -100.
        if word_idx is None:
            label_ids.append(-100)
        elif word_idx != previous_word_idx:  # Only label the first token of a given word.
            label_ids.append(label[word_idx])
        else:
            label_ids.append(-100)
        previous_word_idx = word_idx
    labels.append(label_ids)

tokenized_inputs[&quot;labels&quot;] = labels
return tokenized_inputs


import numpy as np
labels = [label_list[i] for i in example[f"ner_tags"]]
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)
true_predictions = [
    [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]
true_labels = [
    [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]

results = seqeval.compute(predictions=true_predictions, references=true_labels)
return {
    &quot;precision&quot;: results[&quot;overall_precision&quot;],
    &quot;recall&quot;: results[&quot;overall_recall&quot;],
    &quot;f1&quot;: results[&quot;overall_f1&quot;],
    &quot;accuracy&quot;: results[&quot;overall_accuracy&quot;],
}

What should be the labels for subword tokens in BERT for NER task?

2 Answers2