SMOTE Oversampling for Text Classification with Multiple Input Features
I have a text classification problem where the input has 2 features: a text and a language:
the text is a string variable. the language is a string variable that has the following values: "EN", "FR", "DE", etc. and the output is an imbalanced categorical variable.
As a regular NLP problem, the text feature was tokenized using Keras tokenizer and then padded, while the language variable was encoded using one-hot encoding.
SMOTE is needed to perform oversampling, but the 2 input features (the padded sequences of the text and the encoded language) have different shapes. So how can I combine them to pass to SMOTE?