1

I am referring to this question:

Nested cross-validation and selecting the best regression model - is this the right SKLearn process?

In the answers it shows that nested cv can estimate the generalization error of hyperparameter optimization for different algorithms. But in my opinion the choice between different algorithms is also an optimization process, which leads to generalization errors. Therefore, either the algorithm choice should be part of the inner cv or another third cv would have to be introduced to evaluate the error for the algorithm choice. Is this a correct assumption ?

Kasra Manshaei
  • 6,752
  • 1
  • 23
  • 46
Felix Z.
  • 172
  • 5

2 Answers2

1

In general you are right and in this answer it has been done as far as I see. The models are compared to each other while the best tuning of them is found, both inside the loop. It looks fine.

About your point, yes. But the point in Machine learning is that at some point we need to stop/limit our attempts as the number of algorithms which can do the task are very large. We usually try to evaluate different families of algorithms and then narrow the search from there but at the end we can never claim that the best answer we found is necessarily the best possible answer. In another POV, this is the main idea behind many research papers in ML. They just creatively find/modify an algorithm and show that it works better than previously applied algorithm through a benchmark dataset.

Kasra Manshaei
  • 6,752
  • 1
  • 23
  • 46
0

I have often had this question in my mind. If I have a classification problem and I plan to use Lasso regression and/or Random Forest:

  1. should I consider model selection (Lasso or RandomForest) as part of the parameter tuning stage and report generalization error for the combined algorithm including the choice of model or
  2. should I calculate generalization errors, using nested cross-validation, separately for lasso and RandomForest algorithms, pretending that I have decided to use one method but wanted to check how the other method would have performed in comparison.

I would like to favor (2) because that enables comparison of the 2 methods. When I then classify new data I would report predictions based on both models and their estimated accuracies. For example, if the predictions are very diffferent, it would be good to know if both methods had similar accuracy (generalization error) or not.

Mark Nh
  • 1
  • 1