Questions tagged [imbalanced-data]

36 questions
8
votes
1 answer

Categorization of approaches to deal with imbalanced classes

What is the best way to categorize the approaches which have been developed to deal with imbalance class problem? This article categorizes them into: Preprocessing: includes oversampling, undersampling and hybrid methods, Cost-sensitive learning:…
7
votes
2 answers

Doesn't over(/under)sampling an imbalanced dataset cause issues?

I'm reading a lot about how to use different metrics specifically for imbalanced datasets (e.g. two classes present, but 80% of the data is one class) and how to tackle the issue of imbalanced datasets. One trick is to oversample, so to take more…
lte__
  • 1,379
  • 5
  • 19
  • 29
4
votes
2 answers

Flipping the labels in a binary classification gives different model and results

I have an imbalanced dataset and I want to train a binary classifier to model the dataset. Here was my approach which resulted into (relatively) acceptable performance: 1- I made a random split to get train/test sets. 2- In the training set, I…
4
votes
1 answer

How to tackle imbalanced regression?

I've recently encountered a problem where I want to fit a regression model on data that's target variable is like 75% zeroes, and the rest is a continuous variable. This makes it a regression problem, however, the non-zero values also have a very…
lte__
  • 1,379
  • 5
  • 19
  • 29
3
votes
1 answer

Is it sensible to use the ROC curve with an KNN model? And if so why?

I am a beginner doing my first ML project. I am doing a binary supervised classification on an unbalanced dataset and want to use the ROC curve as a performance metric of my models. I am using Logistic Regression, Support Vector Machine and K…
Ludger
  • 33
  • 1
  • 3
3
votes
1 answer

Why is Data with an Overrepresented Class called Imbalanced not Unbalanced?

I've seen the term Imbalanced used to described data that has an over-representation of one class. What's the reasoning behind naming this type of data Imbalanced as opposed to Unbalanced, which seems to fit the intended meaning perfectly already?
Connor
  • 701
  • 6
  • 24
2
votes
1 answer

How to increase the accuracy of an imbalanced dataset (not precision)?

There's an imbalanced dataset in a Kaggle competition I'm trying. The target variable of the dataset is binary and it is biased towards 0. 0 - 70% 1 - 30% I tried several machine learning algorithms like Logistic Regression, Random Forest, Decision…
2
votes
1 answer

On what threshold we should resample the data?

When working with churn datasets, we usually find imbalanced datasets. My question is how to decide on what basis we should resample the data. For example: while splitting the data before training we split in train and test on the threshold (70-30…
2
votes
0 answers

Is balancing class data for imbalanced problems helpful or just folklore when considering thresholds?

Caveat: I'm aware that imbalanced data questions are a dead horse, but I haven't found an answer to this flavor of it directly. When working with highly imbalanced data (e.g. binary class cases), the common wisdom is to try training on an…
Josh
  • 141
  • 3
2
votes
1 answer

Can the attention mask hold values between 0 and 1?

I am new to attention-based models and wanted to understand more about the attention mask in NLP models. attention_mask: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [0, 1]. It's a mask to be used if…
neel g
  • 227
  • 1
  • 5
  • 11
2
votes
3 answers

Measuring performance of customer purchase predictions

My goal is to develop a model that predicts next customer purchases in USD (Update: During the time period of the dataset, if no purchase was made by the customer, the next purchase label is set to zero). I am trying to determine what would be the…
2
votes
2 answers

Why does class_weight usually outperform SMOTE?

I'm trying to figure out what exactly class_weight from sklearn does. When working with imbalanced datasets, I'm always using class_weight because the results are usually better than using SMOTE. However, I'm not sure why. I've tried to find an…
dsbr__0
  • 191
  • 1
  • 5
2
votes
2 answers

Influence of imbalanced feature on prediction

I want to use XGB regression. the dataframe is coneptually similar to this table: index feature 1 feature 2 feature 3 encoded_1 encoded_2 encoded_3 y 0 0.213 0.542 0.125 0 0 1 0.432 1 …
Reut
  • 299
  • 3
  • 15
2
votes
2 answers

Determining if a dataset is balanced

I'm learning about training sets and I have been provided with a set of labelled customer data that segments customers into one of two classes: A or B. The dataset also contains gender, age and profession attributes for each customer. The…
user166673
  • 23
  • 3
1
vote
1 answer

Dealing with high frequency tokens during masked Language modelling?

Suppose I am working with a Masked Language Model to pre-train on a specific dataset. In that dataset, most sequences have a particular token of a high frequency Sample Sequence:- , , , , , ---> here tok4 is very…
neel g
  • 227
  • 1
  • 5
  • 11
1
2 3