1

Basically I am trying to understand how question answering works in case of BERT. Code for both classes QuestionAnswering and Classification is pasted below for reference. My understanding is:

class BertForSequenceClassification(PreTrainedBertModel):
    def __init__(self, config, num_labels=2):
        super(BertForSequenceClassification, self).__init__(config)
        self.num_labels = num_labels
        self.bert = BertModel(config)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, num_labels)
        self.apply(self.init_bert_weights)

    def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None):
        _, pooled_output = self.bert(input_ids, token_type_ids, attention_mask, output_all_encoded_layers=False)
        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)

        if labels is not None:
            loss_fct = CrossEntropyLoss()
            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
            return loss
        else:
            return logits

In Above code pooled_output is considered useful in line _, pooled_output = self.bert(input_ids, token_type_ids, attention_mask, output_all_encoded_layers=False)

And in below QnA code encoder layer output (i.e., sequence_output) is considered useful in line: sequence_output, _ = self.bert(input_ids, token_type_ids, attention_mask, output_all_encoded_layers=False)

class BertForQuestionAnswering(PreTrainedBertModel):
    def __init__(self, config):
        super(BertForQuestionAnswering, self).__init__(config)
        self.bert = BertModel(config)
        # TODO check with Google if it's normal there is no dropout on the token classifier of SQuAD in the TF version
        # self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.qa_outputs = nn.Linear(config.hidden_size, 2)
        self.apply(self.init_bert_weights)

    def forward(self, input_ids, token_type_ids=None, attention_mask=None, start_positions=None, end_positions=None):
        sequence_output, _ = self.bert(input_ids, token_type_ids, attention_mask, output_all_encoded_layers=False)
        logits = self.qa_outputs(sequence_output)
        start_logits, end_logits = logits.split(1, dim=-1)
        start_logits = start_logits.squeeze(-1)
        end_logits = end_logits.squeeze(-1)

Now my questions are:

  1. Why there are 2 logits being returned in sequence_output for Question Answering case

  2. What is different in encoder layer and pooled layer

  3. Why is encoder layer (sequence_output) is considered in QnA case and Pooled layer in classification case

Sandeep Bhutani
  • 914
  • 1
  • 7
  • 26

1 Answers1

2
  1. For Question Answering, you need 2 logits : one for the start position, one for the end position. Based on these 2 logits, you have an answer span (denoted by the start/end position).

  2. In the source code, you have : pooled_output = self.pooler(sequence_output)
    If you take a look at the pooler, there is a comment :

# We "pool" the model by simply taking the hidden state corresponding
# to the first token.

So the sequence output is all the token representations, while the pooled_output is just a linear layer applied to the first token of the sequence.

  1. In classification case, you just need a global representation of your input, and predict the class from this representation. No need to access every token of the sequence, you want a single representation of the sequence.
    But in the Question Answering case, in order to predict the answer span, you need to decide which token will be the start and which token will be the end of the span. For this, you need to consider every token independently.
Astariul
  • 1,014
  • 8
  • 18