0

The Accuracy before ovesampling :

On Training : 98,54% On Testing : 98,21%

The Accuracy after ovesampling :

On Training : 77,92% On Testing : 90,44%

What does mean this and how to increase the accuracy ?

Edit:

Classes before SMOTE:

dataset['Label'].value_counts()

BENIGN 168051 Brute Force 1507 XSS 652 Sql Injection 21

Classes after SMOTE:

BENIGN           117679 
Brute Force      117679 
XSS              117679 
Sql Injection    117679 

I used the following model:

-Random Forest : 
Train score : 0.49   Test score: 0.85
-Logistic Regression : 
Train score: 0.72    Test score: 0.93

-LSTM: Train score: 0.79 Test score: 0.98

enter image description here

Mimi
  • 65
  • 8

2 Answers2

1

Accuracy is not a very good metric generally, but especially in the presence of serious class imbalance. In your case, always predicting BENIGN would achieve an accuracy of 98.72%, but is useless, while your models might be useful despite having lower accuracy.

That oversampling hurts your training accuracy is natural. The largest effect of oversampling is that the predicted probabilities are shifted, and if you are predicting the class as the one with largest predicted probability, this will result in many more predictions of the minority classes, which is wrong from an accuracy point of view. (One thing wasn't made clear in the post: are you measuring performance on the resampled data, or the original? The resampled data won't suffer from the effect above, but might well suffer from poor predictions of the tiny Sql injection class, which might not have enough signal to properly identify.)

As Dave says in a comment, best to start without oversampling, and create a metric that properly captures the cost/benefit tradeoff of true/false positives/negatives for each class. Then it might be beneficial to oversample, but if you are using the predicted probabilities, that is unlikely to give huge lift.

Ben Reiniger
  • 12,855
  • 3
  • 20
  • 63
0

This is weird that test accuracy is greater than the training accuracy. However, the plausible observation/explanation after looking at the distribution of classes are :

  1. The classes are highly imbalance for a multi-class classification setting.
  2. You are using SMOTE for oversampling. You can also try Adaptive Synthetic Sampling Approach for Imbalanced Learning (ADASYN) and check if the result improves.
  3. However, the most important thing is : You should try to optimise for Recall or F1 score given your classes are highly imbalanced. Since, Accuracy is not a preferred metric in a highly imbalanced classification problem. I would recommend optimising for Recall.

Possible recommendations :

  1. Hyperparameter tuning
  2. Better regularisation
  3. K-Fold cross validation.
  4. Make sure that train, validation and test sets are different.
Akash Dubey
  • 696
  • 2
  • 5
  • 19