Imbalanced Dataset: Train/test split before and after SMOTE

Question

This question is similar but different from my previous one. I have a binary classification task related to customer churn for a bank. The dataset contains 10,000 instances and 11 features. The target variable is imbalanced (80% remained as customers (0), 20% churned (1)).

Initially, I followed this approach: I first split the dataset into training and test sets, while preserving the 80-20 ratio for the target variable in both sets. I keep 8,000 instances in the training set and 2,000 in the test set. After pre-processing, I address the class imbalance in the training set with SMOTEENN:

from imblearn.combine import SMOTEENN
smt = SMOTEENN(random_state=random_state)
X_train, y_train = smt.fit_sample(X_train, y_train)

Now, my training set has 4774 1s and 4182 0s. I know proceed to building ML models. I use scikit-learn’s GridSearchCV with cv = KFold(n_splits=5, shuffle=True, random_state=random_state) and optimise based on the recall score. For instance, for a Random Forest Classifier:

cv = KFold(n_splits=5, shuffle=True, random_state=random_state)
scoring_metric='recall'
rf = RandomForestClassifier(random_state=random_state)
param_grid = {
    'n_estimators': [100],
    'criterion': ['entropy', 'gini'],
    'bootstrap': [True, False],
    'max_depth': [6],
    'max_features': ['auto', 'sqrt'],
    'min_samples_leaf': [2, 3, 5],
    'min_samples_split': [2, 3, 5]
}
rf_clf = GridSearchCV(estimator=rf,
                      param_grid=param_grid,
                      scoring=scoring_metric,
                      cv=cv,
                      verbose=False,
                      n_jobs=-1)
best_rf_clf = rf_clf.fit(X_train, y_train)
y_pred = cross_val_predict(best_rf_clf.best_estimator_,X_train, y_train,cv=cv)
print('Train: ', np.round(recall_score(y_train, y_pred), 3))
y_pred = best_rf_clf.best_estimator_.fit(X_train, y_train).predict(X_test)
print(' Test: ', np.round(recall_score(y_test, y_pred), 3))

My recall CV score on the training set is 0.902, while the score on the test is 0.794.

However, when apply SMOTEENN on the full dataset and then split into training and test sets, I get a recall CV score on the training set equal to 0.913, and 0.898 for the test set.

How can we explain this difference between the two approaches? What causes this gap between the two sets in the first approach (split then SMOTEENN) compared to the second one (SMOTEENN and then split)? My guess is that the second approach leads to a more balanced test set (1220 1s, 1036 0s), compared to the first one (1607 1s, 393 0s). Thanks!

score 7 · Accepted Answer · answered Nov 24 '21 at 10:50

Essentially applying SMOTE makes the job easier for the model: SMOTE generates artificial instances which tend to have the same properties as each other, so it's easier for the model to capture their patterns. However these instances are rarely a good representative sample for the minority class, so there's a higher risk that the model overfits.

Of course if SMOTE is also applied to the test set, the model appears to perform better. This is the equivalent of changing a difficult question to an easier one in order to answer the question better.

Resampling methods are rarely a good solution to the imbalance problem. It's important to understand that imbalanced data is a problem only because the minority class in the training set is not representative enough and/or the features are not good enough indicators for the label. The ideal scenario is to solve these two problems, then the model can perform perfectly well despite the imbalance.

score 3 · Answer 2 · answered Nov 25 '21 at 02:38

You must apply SMOTE after splitting into training and test, not before. Doing SMOTE before is bogus and defeats the purpose of having a separate test set.

At a really crude level, SMOTE essentially duplicates some samples (this is a simplification, but it will give you a reasonable intuition). If you duplicate every sample ten times, and then split into train and test, then about 5 copies of each sample will appear in the training set and about 5 copies in the test set. Effectively, the test set becomes more or less identical to the training set. A classifier that just memorizes the training set (and overfits to it massively) will also do very well on the test set -- but it would perform terribly in practice.

So duplicating samples before splitting into training/test is effectively evaluating your classifier on the training set, which we know is a biased measure of its performance. Doing SMOTE before splitting will have similar bad properties.

Imbalanced Dataset: Train/test split before and after SMOTE

2 Answers2

Linked