SMOTE Oversampling for Text Classification with Multiple Input Features

Question

I have a text classification problem where the input has 2 features: a text and a language:

the text is a string variable. the language is a string variable that has the following values: "EN", "FR", "DE", etc. and the output is an imbalanced categorical variable.

As a regular NLP problem, the text feature was tokenized using Keras tokenizer and then padded, while the language variable was encoded using one-hot encoding.

SMOTE is needed to perform oversampling, but the 2 input features (the padded sequences of the text and the encoded language) have different shapes. So how can I combine them to pass to SMOTE?

score 1 · Answer 1 · answered Jun 06 '24 at 15:33

It sounds like you have inputs like:

f1_obs1 <- matrix(seq(1, 6), 2, 3)
f1_obs2 <- matrix(seq(2, 7), 2, 3)
f2_obs1 <- matrix(seq(2, 5), 2, 2)
f2_obs2 <- matrix(seq(1, 4), 2, 2)

Feature 1 is a 2x3 matrix, while feature 2 is a 2x2 matrix. They are incompatible.

But they arent!

What you really have is that feature 1 is six values and feature 2 is four values. Therefore, you have a ten-dimensional feature space.

obs1 <- c(c(f1_obs1), c(f2_obs1))
obs2 <- c(c(f1_obs2), c(f2_obs2))

Then you can do whatever you want with this constant 10-dimensional feature space, whether SMOTE or directly modeling. Since class imbalance turns out to be a lot less of a problem than is often portrayed, you probably don't need to run the SMOTE, but you will have to combine the features like this or in some similar way that recognized then ten-dimensional nature of the features in order to do much of anything.

SMOTE Oversampling for Text Classification with Multiple Input Features

1 Answers1