Training model using BERT

Question

I have generated dataset using chat gpt. Dataset has 9000 data recodes. It's 6 class sentiment analysis. classes are 0,1,2,3,4,5 I used 3000 recodes for training, 1200 recods for validation and testing.

This is the class counts

For training:

5: 622 ,3: 614 ,0: 593,4: 571,2: 604,1: 596

For Validation:

1: 221,5: 193,0: 193,4: 212,3: 182,2: 199

For Testing:

3: 204,2: 197,0: 214,4: 217,1: 183,5: 185

I trained this using BERT('bert-base-uncased') with 25 epochs. Learning rate 2e-5

This is the result.

Validation

Validation Accuracy: 0.5666666666666667
Validation Classification Report:
              precision    recall  f1-score   support
       0       0.53      0.69      0.60       193
       1       0.54      0.50      0.52       221
       2       0.52      0.49      0.51       199
       3       0.54      0.50      0.52       182
       4       0.55      0.60      0.57       212
       5       0.77      0.63      0.69       193

accuracy                           0.57      1200

macro avg       0.58      0.57      0.57      1200
weighted avg       0.57      0.57      0.57      1200

Testing

Test Accuracy: 0.5658333333333333
Test Classification Report:
              precision    recall  f1-score   support
       0       0.57      0.68      0.62       214
       1       0.49      0.56      0.52       183
       2       0.58      0.52      0.55       197
       3       0.59      0.49      0.53       204
       4       0.53      0.57      0.55       217
       5       0.68      0.58      0.62       185

accuracy                           0.57      1200

macro avg       0.57      0.56      0.57      1200
weighted avg       0.57      0.57      0.57      1200

I trained this with using different count of data. but graph shape is same with different accuracies. My questions are:

how to increase accuracy and what whould be the issue?
Is that a issue with some words in multiple classes (because i some words in many classes in this dataset)?

score 1 · Accepted Answer · answered Oct 07 '23 at 17:14

You are overfitting A LOT.

This is usual when finetuning BERT on small datasets. I suggest you take a look at the BERT article to use it as a guidance for sensible hyperparameter values and finetuning strategies. For instance:

They finetuned for 3 epochs.
They ran several random restarts and selected the best model on the Dev set.
At each random restart, they used the same pre-trained checkpoint as starting point but performed different finetuning (e.g. different learning rates from [5e-5, 4e-5, 3e-5, 2e-5]), data shuffling and classifier layer initialization (i.e. different random seeds).

Apart from that, check what optimizer you are using and how you are configuring it. In the original BERT article they used the Adam optimizer but disabling the bias compensation (aka BERTAdam), but this was counterproductive. Check this answer for advice on how to choose the optimizer and its hyperparameters.

score 0 · Answer 2 · answered Oct 09 '23 at 21:52

Your dataset is likely too small to support fine-tuning a whole bert model. The best approach in this case would be to use the pre-trained bert model to compute embeddings of your documents, then train a small MLP to predict classes from the bert embeddings.

Training model using BERT

2 Answers2