Is it good practice to use SMOTE when you have a data set that has imbalanced classes when using BERT model for text classification?

Question

I had a question related to SMOTE. If you have a data set that is imbalanced, is it correct to use SMOTE when you are using BERT? I believe I read somewhere that you do not need to do this since BERT take this into account, but I'm unable to find the article where I read that. Either from your own research or experience, would you say that oversampling using SMOTE (or some other algorithm) is useful when classifying using a BERT model? Or would it be redundant/unnecessary?

score 5 · Accepted Answer · answered Mar 25 '21 at 23:58

I don't know about any specific recommendation related to BERT, but my general advice is this:

Do not to systematically use oversampling when the data is imbalanced, at least not before specifically identifying performance issues caused by the imbalance. I see many questions here on DataScienceSE about solving problems which are caused by using oversampling blindly (and often wrongly).
In general resampling doesn't work well with text data, because language diversity cannot be simulated this way. There is a high risk to obtain an overfit model.

score 0 · Answer 2 · answered Jan 04 '22 at 22:09

Sorry for the delay in my response, but it is posible that others also benefit from it. I have had the same problem before. I would say an effective way to overcome the problem in most cases is to first try augmentation using some techniques from nlpaug and backtranslation from several languages see here and then drop duplicates.

Is it good practice to use SMOTE when you have a data set that has imbalanced classes when using BERT model for text classification?

2 Answers2