Why class weight is outperforming oversampling?

Question

I am applying both class_weight and oversampling (SMOTE) techniques on a multiclass classification problem and getting better results when using the class_weight technique. Could someone please explain what could be the cause of this difference?

aranglol · Accepted Answer · 2019-05-27T05:43:12.760

You should not expect class_weight parameters and SMOTE to give the exact same results because they are different methods.

Class weights directly modify the loss function by giving more (or less) penalty to the classes with more (or less) weight. In effect, one is basically sacrificing some ability to predict the lower weight class (the majority class for unbalanced datasets) by purposely biasing the model to favor more accurate predictions of the higher weighted class (the minority class).

Oversampling and undersampling methods essentially give more weight to particular classes as well (duplicating observations duplicates the penalty for those particular observations, giving them more influence in the model fit), but due to data splitting that typically takes place in training this will yield slightly different results as well.

SMOTE creates new observations of the minority class by randomly sampling from a set of "similar" minority class observations. Synthesized observations are computed based off adding a random percentage of the difference between two randomly chosen "similar" observations, for each coordinate (i.e. column). Similar observations are defined typically using the k most closest neighbours to a particular observation of the minority class. This means that, depending on the value of k chosen, as well as numerous other factors such as how similar your observations are in general, the distance measure, etc. SMOTE may or may not be useful for your particular problem.

Methods that deal with class imbalance do not all work the same and is a large area of study. It appears you have noticed that the class weight method is more effective, which is why any sort of method chosen in addressing any potential class imbalance problem needs to be wrapped within a model validation scheme to see if the particular method (or really, using any method at all) yields any benefit.

score 5 · Answer 2 · answered May 26 '19 at 09:26

Probably not the answer you're looking for, but don't go crazy! Different class weight strategies give different results.

The follwing drove me almost crazy! The following should give the same results, but it doesn't.

class_weight = "balanced"

class_weight={0:0.85, 1:0.15}

I learned to live with it ...

Why class weight is outperforming oversampling?

2 Answers2