16

After I developed my predictive model using Random Forest I get the following metrics:

        Train Accuracy ::  0.9764634601043997
        Test Accuracy  ::  0.7933284397683713
         Confusion matrix  [[28292  1474]
                            [ 6128   889]]

This is the results from this code:

  training_features, test_features, training_target, test_target, = train_test_split(df.drop(['bad_loans'], axis=1),
                                                  df['target'],
                                                  test_size = .3,
                                                  random_state=12)
clf = RandomForestClassifier()
trained_model = clf.fit(training_features, training_target)
trained_model.fit(training_features, training_target)
predictions = trained_model.predict(test_features)      

Train Accuracy: accuracy_score(training_target, trained_model.predict(training_features))
Test Accuracy: accuracy_score(test_target, predictions)
Confusion Matrix: confusion_matrix(test_target, predictions)

However I'm getting a little confuse to interpret and explain this values.

What exactly this 3 measures tell me about my model?

Thanks!

Pedro Alves
  • 367
  • 2
  • 3
  • 11

1 Answers1

25

Definitions

  • Accuracy: The amount of correct classifications / the total amount of classifications.
  • The train accuracy: The accuracy of a model on examples it was constructed on.
  • The test accuracy is the accuracy of a model on examples it hasn't seen.
  • Confusion matrix: A tabulation of the predicted class (usually vertically) against the actual class (thus horizontally).

Overfitting

What I would make up of your results is that your model is overfitting. You can tell that from the large difference in accuracy between the test and train accuracy. Overfitting means that it learned rules specifically for the train set, those rules do not generalize well beyond the train set.

Your confusion matrix tells us how much it is overfitting, because your largest class makes up over 90% of the population. Assuming that you test and train set have a similar distribution, any useful model would have to score more than 90% accuracy: A simple 0R-model would. Your model scores just under 80% on the test set.

In depth look at the confusion matrix

If you would look at the confusion matrix relatively (in percentages) it would look like this:

               Actual    TOT
               1    2
Predicted 1 | 77% | 4% | 81%  
Predicted 2 | 17% | 2% | 19%
TOT         | 94% | 6% |

You can infer from the total in the first row that your model predicts Class 1 81% of the time, while the actual occurrence of Class 1 is 94%. Hence your model is underestimating this class. It could be the case that it learned specific (complex) rules on the train set, that work against you in the test set.

It could also be worth noting that even though the false negatives of Class 1 (17%-point, row 2, column 1)) are hurting your overall performance most, the false negatives of Class 2 (4%-point, row 1 column 2) are actually more common with respect to the total population of the respective classes (94%, 6%). This means that your model is bad at predicting Class 1, but even worse at predicting Class 2. The accuracy just for Class 1 is 77/99 while the accuracy for Class 2 is 2/6.

S van Balen
  • 1,364
  • 1
  • 9
  • 28