Highest Voted 'imbalanced-data' Questions - Data Science Stack Exchange

8

votes

1 answer

Categorization of approaches to deal with imbalanced classes

What is the best way to categorize the approaches which have been developed to deal with imbalance class problem? This article categorizes them into: Preprocessing: includes oversampling, undersampling and hybrid methods, Cost-sensitive learning:…

asked Jun 08 '18 at 05:10

ebrahimi

1,305
7
20
40

7

votes

2 answers

Doesn't over(/under)sampling an imbalanced dataset cause issues?

I'm reading a lot about how to use different metrics specifically for imbalanced datasets (e.g. two classes present, but 80% of the data is one class) and how to tackle the issue of imbalanced datasets. One trick is to oversample, so to take more…

classification class-imbalance imbalanced-data

asked Apr 29 '21 at 13:59

lte__

1,379
5
19
29

4

votes

2 answers

Flipping the labels in a binary classification gives different model and results

I have an imbalanced dataset and I want to train a binary classifier to model the dataset. Here was my approach which resulted into (relatively) acceptable performance: 1- I made a random split to get train/test sets. 2- In the training set, I…

python classification scikit-learn class-imbalance imbalanced-data

asked Nov 03 '22 at 14:38

Farzad

43
4

4

votes

1 answer

How to tackle imbalanced regression?

I've recently encountered a problem where I want to fit a regression model on data that's target variable is like 75% zeroes, and the rest is a continuous variable. This makes it a regression problem, however, the non-zero values also have a very…

regression imbalanced-data

asked Apr 22 '22 at 13:14

lte__

1,379
5
19
29

3

votes

1 answer

Is it sensible to use the ROC curve with an KNN model? And if so why?

I am a beginner doing my first ML project. I am doing a binary supervised classification on an unbalanced dataset and want to use the ROC curve as a performance metric of my models. I am using Logistic Regression, Support Vector Machine and K…

svm k-nn roc imbalanced-data

asked Dec 15 '22 at 10:15

Ludger

33
1
3

3

votes

1 answer

Why is Data with an Overrepresented Class called Imbalanced not Unbalanced?

I've seen the term Imbalanced used to described data that has an over-representation of one class. What's the reasoning behind naming this type of data Imbalanced as opposed to Unbalanced, which seems to fit the intended meaning perfectly already?

class-imbalance terminology imbalanced-data

asked Nov 09 '22 at 21:13

Connor

701
6
24

2

votes

1 answer

How to increase the accuracy of an imbalanced dataset (not precision)?

There's an imbalanced dataset in a Kaggle competition I'm trying. The target variable of the dataset is binary and it is biased towards 0. 0 - 70% 1 - 30% I tried several machine learning algorithms like Logistic Regression, Random Forest, Decision…

dataset visualization preprocessing imbalanced-data

asked Jul 23 '21 at 05:11

section117

33
1
4

2

votes

1 answer

On what threshold we should resample the data?

When working with churn datasets, we usually find imbalanced datasets. My question is how to decide on what basis we should resample the data. For example: while splitting the data before training we split in train and test on the threshold (70-30…

class-imbalance imbalanced-data

asked May 28 '21 at 19:14

Jaya Raghavendra

121
1

2

votes

0 answers

Is balancing class data for imbalanced problems helpful or just folklore when considering thresholds?

Caveat: I'm aware that imbalanced data questions are a dead horse, but I haven't found an answer to this flavor of it directly. When working with highly imbalanced data (e.g. binary class cases), the common wisdom is to try training on an…

class-imbalance imbalanced-data

asked May 19 '21 at 02:11

Josh

141
3

2

votes

1 answer

Can the attention mask hold values between 0 and 1?

I am new to attention-based models and wanted to understand more about the attention mask in NLP models. attention_mask: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [0, 1]. It's a mask to be used if…

machine-learning nlp attention-mechanism imbalanced-data

asked May 16 '21 at 12:47

neel g

227
1
5
11

2

votes

3 answers

Measuring performance of customer purchase predictions

My goal is to develop a model that predicts next customer purchases in USD (Update: During the time period of the dataset, if no purchase was made by the customer, the next purchase label is set to zero). I am trying to determine what would be the…

predictive-modeling metric imbalanced-data rmse

asked Mar 20 '22 at 15:06

Shlomi Schwartz

31
9

2

votes

2 answers

Why does class_weight usually outperform SMOTE?

I'm trying to figure out what exactly class_weight from sklearn does. When working with imbalanced datasets, I'm always using class_weight because the results are usually better than using SMOTE. However, I'm not sure why. I've tried to find an…

classification class-imbalance smote imbalanced-data

asked Jan 20 '22 at 22:10

dsbr__0

191
1
5

2

votes

2 answers

Influence of imbalanced feature on prediction

I want to use XGB regression. the dataframe is coneptually similar to this table: index feature 1 feature 2 feature 3 encoded_1 encoded_2 encoded_3 y 0 0.213 0.542 0.125 0 0 1 0.432 1 …

python regression xgboost imbalanced-data

asked Nov 09 '21 at 09:22

Reut

299
3
15

2

votes

2 answers

Determining if a dataset is balanced

I'm learning about training sets and I have been provided with a set of labelled customer data that segments customers into one of two classes: A or B. The dataset also contains gender, age and profession attributes for each customer. The…

imbalanced-data

asked Oct 04 '21 at 03:02

user166673

23
3

1

vote

1 answer

Dealing with high frequency tokens during masked Language modelling?

Suppose I am working with a Masked Language Model to pre-train on a specific dataset. In that dataset, most sequences have a particular token of a high frequency Sample Sequence:- , , , , , ---> here tok4 is very…

machine-learning language-model imbalanced-data masking

asked May 14 '21 at 17:19

neel g

227
1
5
11

Questions tagged [imbalanced-data]