0

I have an imbalanced sample (850 in group X vs 100 in group Y). I am trying to predict group membership using support vector classifcation. I am using 'Adaptive Synthetic' (ADASYN) to oversample the minority class. Nevertheless, the best model just assigns the majority group label to all subjects. Questions:

  1. Besides other oversampling techniques such as SMOTE, setting 'class_weight' to 'balanced', or using a different classifier that is better suited for imbalanced datasets such as RFC, are there other ways to address the imbalance (preferable suited for use with SVC)?

  2. Is labeling all subjects as the majority class an indication that my features are having very limited predictive value?

Vincent
  • 103
  • 4

1 Answers1

1

To answer your first question, you could try different resampling techniques (random oversampling, SMOTE-ENN, etc.) to see if that would help. Though this may not be the answer you want to hear, SVM might not be the best model for this dataset; since it is small, you could try logistic regression; you could also try models that do better on imbalanced data such as LightGBM or XGBoost or decision trees. Maybe you could try neural networks (but they may not do that well since your dataset is small). Doing regularization could help, such as changing the "C" hyperparameter for the SVM, or trying L1 or L2 regularization.

As you want to stick with SVM, you could try doing hyperparameter tuning (RandomizedSearchCV from sklearn if you are using Python could be a good choice). This could help the SVM model better classify the data (tuning the C hyperparameter would be good here). You could also try experimenting with the kernels (linear, RBF, etc.)

As for your second question, the model predicting all of the test set as the majority class may mean that it is not training well or basically not really "learning". This could mean that some of your features don't have that good predictive value, but to see you could look at feature importance. Maybe to help this, you could try doing preprocessing if you are not doing that already (scaling numerical features, encoding, imputing if there is missing data, etc).

Hope this helps!

user167433
  • 183
  • 5