6

I asked this question on the Statistics SE, but there were no answers, even when a modest bonus was available, so I am asking here to see if any examples can be given.

I have been looking into the imbalanced learning problem, where a classifier is often expected to be unduly biased in favour of the majority class. However, I am having difficulties identifying datasets where class imbalance is genuinely a problem and furthermore, where it is actually a problem, that it can be fixed by re-sampling (e.g. SMOTE) or re-weighting the data.

Can anyone give reproducible examples of real-world (preferably not synthetic) datasets where re-sampling or re-weighting can be used to improve the accuracy (or equivalently misclassification error rate) for some particular classifier system (when applied in accordance with best practice)? This must be an improvement in accuracy on the original data distribution, not the resampled one, as that reflects operational conditions where the classifier will be deployed.

I am only interested in accuracy as the performance measure. There are some tasks where accuracy is the quantity of interest in the application, so I would appreciate it if there were no digressions onto the topic of proper scoring rules, or other performance measures.

It is not an example of the class imbalance problem if the operational class frequencies are different to those in the training set or the misclassification costs are not equal. Cost-sensitive learning is a different issue.

UPDATE: While the answer that received the bounty was not ideal (as it didn't appear to apply the classifier in accordance with best practice), I may well give a new bounty to answers that more fully address the question.

Dikran Marsupial
  • 650
  • 3
  • 11

3 Answers3

3

From my experience in real-world data, I have never seen consistent domains in which resample techniques improve the model's performance. Remember that one of the main assumptions of learning is that training data is the same system generation that test data, i.e. they both have the same distribution, which holds not true when applying resampling.

Instead, I would go for cost sensitive learning so that we penalise most of the minority class cases.

Im sharing an example of data in which SMOTE showed a slight increase across different metrics.

import pandas as pd
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from lightgbm import LGBMClassifier
from sklearn.metrics import classification_report
import urllib.request

Load the dataset from a URL

url = 'https://storage.googleapis.com/download.tensorflow.org/data/creditcard.csv' filename = 'creditcard.csv' urllib.request.urlretrieve(url, filename)

Load the dataset into a Pandas dataframe

df = pd.read_csv(filename)

Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(df.drop('Class', axis=1), df['Class'], test_size=0.3, random_state=42)

Perform SMOTE oversampling on the training set

smote = SMOTE(random_state=42) X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

Train a classifier on the resampled training set

rfc = LGBMClassifier(random_state=42).fit(X_train_resampled, y_train_resampled)

Evaluate the classifier on the original testing set

y_pred = rfc.predict(X_test) print(classification_report(y_test, y_pred))

Train a classifier on the original training set

rfc = LGBMClassifier(random_state=42).fit(X_train, y_train)

Evaluate the classifier on the original testing set

y_pred = rfc.predict(X_test) print(classification_report(y_test, y_pred))

Hope it helps!

Multivac
  • 3,199
  • 2
  • 10
  • 26
0

A well known example is the Breast Cancer Wisconsin Data Set with a target variable inbalance of 63% - 37% (on the version published on Kaggle). There is a plethora of research out there which uses things such as SMOTE to improve accuracy.

There are also a lot of Kaggle Notebooks which do the same, which you would be easily able to run yourself. Just looking through a couple, this notebook would be an example which shows how SMOTE improves the accuracy of XGB on this dataset. I have not verified the quality of this notebook.

Note that you stated you wanted a dataset ``where it is actually a problem". This is highly subjective. In this dataset, with the notebook linked, increasing accuracy with 3% seems significant enough for a healthcare application to deem my approval for showing that class imbalance is a problem. However, this is completely arbitrary.

0

One example of real-world imbalanced data is credit card fraud.

Here is code showing empirically better performance for SMOTE:

import imblearn
from imblearn.pipeline import make_pipeline
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import balanced_accuracy_score
from sklearn.model_selection import train_test_split

Load data and split

data = pd.read_csv("creditcard.csv", header=1).values X, y = data[:, :-1], data[:, -1] X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

Without SMOTE

lr = LogisticRegression(solver='liblinear', class_weight=None) lr.fit(X_train, y_train) y_pred = lr.predict(X_test) print(balanced_accuracy_score(y_test, y_pred))

The balanced accuracy for non-SMOTE is ~ 0.774.

# With SMOTE
pipe = make_pipeline(imblearn.over_sampling.SMOTE(),
                     LogisticRegression(solver='liblinear', class_weight=None))
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
print(balanced_accuracy_score(y_test, y_pred))

The balanced accuracy for SMOTE is ~0.935.

Addendum:

It is often possible to maximize regular accuracy for an imbalanced dataset by always predicting the majority class (not fitting a machine learning model or resampling).

import numpy as np
from sklearn.metrics import accuracy_score

Always predicting majority class

y_pred = np.zeros(len(y_test)) print(accuracy_score(y_test, y_pred))

The regular accuracy for always predicting the majority class is ~0.998.

Ben Reiniger
  • 12,855
  • 3
  • 20
  • 63
Brian Spiering
  • 23,131
  • 2
  • 29
  • 113