1

I am struggling with confusion matrices and their outputs. I thought to follow all the steps right, but unfortunately it seems that something is not going well.

I had a dataset built and labelled on my own. It shows a class imbalance so I decided to apply undersampling and oversampling, looking at F1-score and Recall as in many papers and on the web. The steps were:

  • split data in train and test (80/20)
  • apply resampling only on train set
  • apply pre-processing algorithm (BoW, TF-IDF, ...)
  • use different classifiers to get results
  • look at performance using confusion matrices (or alternatively ROC)

I tried with different features: in one dataset with less features engineering, i.e., using only features from Text, I got a maximum value of F1-score equal to 68%. With more features, that I thought to be significant for improving the model, I am getting max 64%, that is weird considering the problem (email classification for spam detection). In theory, if I extract features only from text, I get a better score rather than extracting also features from email addresses. I would like to ask you for some tips and suggestions, if you have any, as I think that this cannot be possible, as the expected results should be higher in the second case, when I consider also information from email address (number of dots, suffix, registration date,...).

I am thinking at a problem of overfitting or some other issues with model building. I would appreciated if you could tell me your thoughts on this.

Thank you for all your help.

V_sqrt
  • 295
  • 1
  • 8

1 Answers1

1

I tried with different features: in one dataset with less features engineering, i.e., using only features from Text, I got a maximum value of F1-score equal to 68%. With more features, that I thought to be significant for improving the model, I am getting max 64%, that is weird considering the problem (email classification for spam detection).

Typically this would happen if the model is overfit: not enough data and/or too many features make the model pick patterns which happen by chance in the training data.

Usually with text one has to remove the least frequent words in order to avoid overfitting. You might also want to check the additional features, remove anything which happens too rarely.

Also, the confusion matrix, has given me weird outputs

       0      1
0    [[2036  161]
1    [   1 2196]]

Observations:

  • True class 0 has 2036+161 = 2197 instances, true class 1 has 1+2196=2197 instances. These results are obtained with the resampled data.
  • Assuming class 1 is positive: 2196 True Positives (TP), 2036 TN, 161 FP (true positive predicted as negative) and 1 FN (true negative predicted as positive).
    • recall = 0.999, precision = 0.932. That's an f1-score somewhere higher than 0.95 (probably due to the resampled data).
  • The second confusion matrix is also clearly obtained with the resampled data, and it shows perfect performance (F1-score is 1).

These matrices show the performance obtained on the resampled data, so it's similar to the performance on the training data. Since the performance on a real test set is much lower, this confirms strong overfitting.

Erwan
  • 26,519
  • 3
  • 16
  • 39