0

I had a question related to SMOTE. If you have a data set that is imbalanced, is it correct to use SMOTE when you are using BERT? I believe I read somewhere that you do not need to do this since BERT take this into account, but I'm unable to find the article where I read that. Either from your own research or experience, would you say that oversampling using SMOTE (or some other algorithm) is useful when classifying using a BERT model? Or would it be redundant/unnecessary?

QMan5
  • 133
  • 5

2 Answers2

5

I don't know about any specific recommendation related to BERT, but my general advice is this:

  • Do not to systematically use oversampling when the data is imbalanced, at least not before specifically identifying performance issues caused by the imbalance. I see many questions here on DataScienceSE about solving problems which are caused by using oversampling blindly (and often wrongly).
  • In general resampling doesn't work well with text data, because language diversity cannot be simulated this way. There is a high risk to obtain an overfit model.
Erwan
  • 26,519
  • 3
  • 16
  • 39
0

Sorry for the delay in my response, but it is posible that others also benefit from it. I have had the same problem before. I would say an effective way to overcome the problem in most cases is to first try augmentation using some techniques from nlpaug and backtranslation from several languages see here and then drop duplicates.