7

The project I am working on allows users to create Stock Screeners based on both technical and fundamental criteria. Stock Screeners are then "backtested" by simulating the results of applying in over the last 10 years using Point-in-Time data. I get back the list of trades and overall graph of performance. (If that is unclear, I have an overview here and there with more details).

Now a common problem is that users create overfitted stock screeners. I would love to give them a warning when the screen is likely to be over-fitted.

Fields I have to work with

  • All trades made by the Stock Screener
    • Stock, Start Date, Start Price, End Date, End Price
  • S&P 500 performance for the same time frame
  • Market Cap, Sector, and Industry of each Stock

1 Answers1

9

Learning curves or bias-variance decomposition are the gold standard for detecting high variance, aka: overfitting. Separate your data (in your case the "back data") into 60% training data and 40% testing data. Fit the model on the training data as you usually would and see how well it is working with the test data.

Finally, when you think you have the model that you want, split each of the training and test sets into 10-100 subsets and retrain and test with incrementally larger sets. Apply your favorite performance metric and plot the results of performance vs. the number of cases used for testing and training.

The curves will never come together if the model is overfit (high variance). The curves will come together but the performance will be lower than desired if the model is underfit (high bias) and the lines will come together at an acceptable performance for a well performing model that is not overfit.

Here is an example of overfitting and underfitting with root mean square error as the performance metric: Bias-Variance decomposition via learning curves

Here is a pretty good link on the process and here is another one. Hope this helps!

AN6U5
  • 6,878
  • 1
  • 26
  • 42