46

Let's say we have two models trained. And let's say we are looking for good accuracy. The first has an accuracy of 100% on training set and 84% on test set. Clearly over-fitted. The second has an accuracy of 83% on training set and 83% on test set.

On the one hand, model #1 is over-fitted but on the other hand it still yields better performance on an unseen test set than the good general model in #2.

Which model would you choose to use in production? The First or the Second and why?

EitanT
  • 569
  • 4
  • 3

8 Answers8

23

There are a couple of nuances here.

  1. Complexity question very important - ocams razor
  2. CV - is this trully the case 84%/83% (test it for train+test with CV)

Given this, personal opinion: Second one.

Better to catch general patterns. You already know that first model failed on that because of the train and test difference. 1% says nothing.

Noah Weber
  • 5,829
  • 1
  • 13
  • 26
17

It depends mostly on the problem context. If predictive performance is all you care about, and you believe the test set to be representative of future unseen data, then the first model is better. (This might be the case for, say, health predictions.)

There are a number of things that would change this decision.

  1. Interpretability / explainability. This is indirect, but parametric models tend to be less overfit, and are also generally easier to interpret or explain. If your problem lies in a regulated industry, it might be substantially easier to answer requests with a simpler model. Related, there may be some ethical concerns with high-variance models or non-intuitive non-monotonicity.

  2. Concept drift. If your test set is not expected to be representative of production data (most business uses), then it may be the case that more-overfit models suffer more quickly from model decay. If instead the test data is just bad, the test scores may not mean much in the first place.

  3. Ease of deployment. While ML model deployment options are now becoming much easier and more sophisticated, a linear model is still generally easier to deploy and monitor.

See also
Can we use a model that overfits?
What to choose: an overfit model with higher evaluation score or a non-overfit model with lower one?
https://stats.stackexchange.com/q/379589/232706
https://stats.stackexchange.com/q/220807/232706
https://stats.stackexchange.com/q/494496/232706
https://innovation.enova.com/from-traditional-to-advanced-machine-learning-algorithms/

(One last note: the first model may well be amenable to some sort of regularization, which will trade away training accuracy for a simpler model and, hopefully, a better testing accuracy.)

Ben Reiniger
  • 12,855
  • 3
  • 20
  • 63
10

The first has an accuracy of 100% on training set and 84% on test set. Clearly over-fitted.

Maybe not. It's true that 100% training accuracy is usually a strong indicator of overfitting, but it's also true that an overfit model should perform worse on the test set than a model that isn't overfit. So if you're seeing these numbers, something unusual is going on.

If both model #1 and model #2 used the same method for the same amount of time, then I would be rather reticent to trust model #1. (And if the difference in test error is only 1%, it wouldn't be worth the risk in any case; 1% is noise.) But different methods have different characteristics with regard to overfitting. When using AdaBoost, for example, test error has often been observed not only to not increase, but actually continue decreasing even after the training error has gone to 0 (An explanation of which can be found in Schapire et. al. 1997). So if model #1 used boosting, I would be much less worried about overfitting, whereas if it used linear regression, I'd be extremely worried.

The solution in practice would be to not make the decision based only on those numbers. Instead, retrain on a different training/test split and see if you get similar results (time permitting). If you see approximately 100%/83% training/test accuracy consistently across several different training/test splits, you can probably trust that model. If you get 100%/83% one time, 100%/52% the next, and 100%/90% a third time, you obviously shouldn't trust the model's ability to generalize. You might also keep training for a few more epochs and see what happens to the test error. If it is overfitting, the test error will probably (but not necessarily) continue increasing.

Ray
  • 201
  • 1
  • 3
5

These numbers suggest that the first model is not, in fact, overfit. Rather, it suggests that your training data had few data points near the decision boundary. Suppose you're trying to classify everyone as older or younger than 13 y.o. If your test set contains only infants and sumo wrestlers, then "older if weight > 100 kg, otherwise younger" is going to work really well on the test set, not so well on the general population.

The bad part of overfitting isn't that it's doing really well on the test set, it's that it's doing poorly in the real world. Doing really well on the test set is an indicator of this possibility, not a bad thing in and of itself.

If I absolutely had to choose one, I would take the first, but with trepidation. I'd really want to do more investigation. What are the differences between train and test set, that are resulting in such discrepancies? The two models are both wrong on about 16% of the cases. Are they the same 16% of cases, or are they different? If different, are there any patterns about where the models disagree? Is there a meta-model that can predict better than chance which one is right when they disagree?

Acccumulation
  • 311
  • 1
  • 3
4

Obviously the answer is highly subjective; in my case clearly the SECOND. Why? There's nothing worse than seeing a customer running a model in production and not performing as expected. I've had literally had a technical CEO who wanted to get a report of how many customers have left in a given month and the customer churn prediction model. It was not fun :-(. Since then, I strongly favor high bias/low variance models.

FrancoSwiss
  • 1,087
  • 6
  • 10
3

It seems a lot of people misunderstand overfitting here. Overfitting is not the gap between train and test performance. Overfitting is when you add complexity to a model, and there is no return on investment, or, most times, a loss in return.

See page 38 of Elements of Statistical Learning for the graph below. An overfit model is one that is to the right of the minimum of test error. Notice that for the best-fitting model, the gap between train and test is still relatively high. The correct answer is to choose the model with 84% accuracy, assuming you know that the 1% difference is statistically significant. I'm of course also assuming interpretability is not of concern here.

enter image description here

Nick Corona
  • 133
  • 1
  • 7
0

If your options are indeed "100% on train / 84% on validation" vs "83% on train / 83% on validation", I'd feel safer with the second one - but really, I'd take a third option: Try and tweak the first model to reduce overfitting (with the usual methods), hopefully squeezing a bit more accuracy out of it.

Itamar Mushkin
  • 1,129
  • 5
  • 17
0

Primarily, go for CV for the training and test set. If you still get the same type of result, then choose the second model.

The first model has a very large difference in accuracy between the training and test set. It is a very specific model. There is a chance that the high accuracy on the test set appeared due to data leakage.

The second model is a more general purpose model with acceptable accuracy results on both sets.

Ethan
  • 1,657
  • 9
  • 25
  • 39