Reproducible examples where balancing the training data demonstrably improves accuracy

Question

I asked this question on the Statistics SE, but there were no answers, even when a modest bonus was available, so I am asking here to see if any examples can be given.

I have been looking into the imbalanced learning problem, where a classifier is often expected to be unduly biased in favour of the majority class. However, I am having difficulties identifying datasets where class imbalance is genuinely a problem and furthermore, where it is actually a problem, that it can be fixed by re-sampling (e.g. SMOTE) or re-weighting the data.

Can anyone give reproducible examples of real-world (preferably not synthetic) datasets where re-sampling or re-weighting can be used to improve the accuracy (or equivalently misclassification error rate) for some particular classifier system (when applied in accordance with best practice)? This must be an improvement in accuracy on the original data distribution, not the resampled one, as that reflects operational conditions where the classifier will be deployed.

I am only interested in accuracy as the performance measure. There are some tasks where accuracy is the quantity of interest in the application, so I would appreciate it if there were no digressions onto the topic of proper scoring rules, or other performance measures.

It is not an example of the class imbalance problem if the operational class frequencies are different to those in the training set or the misclassification costs are not equal. Cost-sensitive learning is a different issue.

UPDATE: While the answer that received the bounty was not ideal (as it didn't appear to apply the classifier in accordance with best practice), I may well give a new bounty to answers that more fully address the question.

score 3 · Answer 1 · answered Apr 21 '23 at 05:31

From my experience in real-world data, I have never seen consistent domains in which resample techniques improve the model's performance. Remember that one of the main assumptions of learning is that training data is the same system generation that test data, i.e. they both have the same distribution, which holds not true when applying resampling.

Instead, I would go for cost sensitive learning so that we penalise most of the minority class cases.

Im sharing an example of data in which SMOTE showed a slight increase across different metrics.

import pandas as pd
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from lightgbm import LGBMClassifier
from sklearn.metrics import classification_report
import urllib.request
Load the dataset from a URL
url = 'https://storage.googleapis.com/download.tensorflow.org/data/creditcard.csv'
filename = 'creditcard.csv'
urllib.request.urlretrieve(url, filename)
Load the dataset into a Pandas dataframe
df = pd.read_csv(filename)
Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df.drop('Class', axis=1), df['Class'], test_size=0.3, random_state=42)
Perform SMOTE oversampling on the training set
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
Train a classifier on the resampled training set
rfc = LGBMClassifier(random_state=42).fit(X_train_resampled, y_train_resampled)
Evaluate the classifier on the original testing set
y_pred = rfc.predict(X_test)
print(classification_report(y_test, y_pred))
Train a classifier on the original training set
rfc = LGBMClassifier(random_state=42).fit(X_train, y_train)
Evaluate the classifier on the original testing set
y_pred = rfc.predict(X_test)
print(classification_report(y_test, y_pred))

Hope it helps!

score 0 · Answer 2 · answered Apr 20 '23 at 12:49

A well known example is the Breast Cancer Wisconsin Data Set with a target variable inbalance of 63% - 37% (on the version published on Kaggle). There is a plethora of research out there which uses things such as SMOTE to improve accuracy.

There are also a lot of Kaggle Notebooks which do the same, which you would be easily able to run yourself. Just looking through a couple, this notebook would be an example which shows how SMOTE improves the accuracy of XGB on this dataset. I have not verified the quality of this notebook.

Note that you stated you wanted a dataset ``where it is actually a problem". This is highly subjective. In this dataset, with the notebook linked, increasing accuracy with 3% seems significant enough for a healthcare application to deem my approval for showing that class imbalance is a problem. However, this is completely arbitrary.

score 0 · Answer 3 · edited Apr 21 '23 at 15:34

One example of real-world imbalanced data is credit card fraud.

Here is code showing empirically better performance for SMOTE:

import imblearn
from imblearn.pipeline import make_pipeline
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import balanced_accuracy_score
from sklearn.model_selection import train_test_split
Load data and split
data = pd.read_csv("creditcard.csv", header=1).values
X, y = data[:, :-1], data[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
Without SMOTE
lr =  LogisticRegression(solver='liblinear', class_weight=None)
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
print(balanced_accuracy_score(y_test, y_pred))

The balanced accuracy for non-SMOTE is ~ 0.774.

# With SMOTE
pipe = make_pipeline(imblearn.over_sampling.SMOTE(),
                     LogisticRegression(solver='liblinear', class_weight=None))
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
print(balanced_accuracy_score(y_test, y_pred))

The balanced accuracy for SMOTE is ~0.935.

Addendum:

It is often possible to maximize regular accuracy for an imbalanced dataset by always predicting the majority class (not fitting a machine learning model or resampling).

import numpy as np
from sklearn.metrics import accuracy_score
Always predicting majority class
y_pred = np.zeros(len(y_test))
print(accuracy_score(y_test, y_pred))

The regular accuracy for always predicting the majority class is ~0.998.

Reproducible examples where balancing the training data demonstrably improves accuracy

3 Answers3

Load the dataset from a URL

Load the dataset into a Pandas dataframe

Split the dataset into training and testing sets

Perform SMOTE oversampling on the training set

Train a classifier on the resampled training set

Evaluate the classifier on the original testing set

Train a classifier on the original training set

Evaluate the classifier on the original testing set

Load data and split

Without SMOTE

Always predicting majority class