16

I've been working on machine learning and bioinformatics for a while, and today I had a conversation with a colleague about the main general issues of data mining.

My colleague (who is a machine learning expert) said that, in his opinion, the arguably most important practical aspect of machine learning is how to understand whether you have collected enough data to train your machine learning model.

This statement surprised me, because I had never given that much importance to this aspect...

I then looked for more information on the internet, and I found this post on FastML.com reporting as rule of thumb that you need roughly 10 times as many data instances as there are features.

Two questions:

1 - Is this issue really particularly relevant in machine learning?

2 - Is the 10 times rule working? Are there any other relevant sources for this theme?

Sean Owen
  • 6,664
  • 6
  • 33
  • 44
DavideChicco.it
  • 281
  • 1
  • 3
  • 7

3 Answers3

8

The ten times rule seems like a rule of thumb to me, but it is true that the performance of your machine learning algorithm may decrease if you do not feed it with enough training data.

A practical and data-driven way of determining whether you have enough training data is by plotting a learning curve, like the one in the example below:

Learning curve

The learning curve represents the evolution of the training and test errors as you increase the size of your training set.

  • The training error increases as you increase the size of your dataset, because it becomes harder to fit a model that accounts for the increasing complexity/variability of your training set.
  • The test error decreases as you increase the size of your dataset, because the model is able to generalise better from a higher amount of information.

As you can see on the rightmost part of the plot, the two lines in the plot tend to reach and asymptote. Therefore, you eventually will reach a point in which increasing the size of your dataset will not have an impact on your trained model.

The distance between the test error and training error asymptotes is a representation of your model's overfitting. But more importantly, this plot is saying whether you need more data. Basically, if you represent test and training error for increasing larger subsets of your training data, and the lines do not seem to be reaching an asymptote, you should keep collecting more data.

Pablo Suau
  • 1,809
  • 1
  • 14
  • 20
6
  1. Yes, the issue is certainly relevant, since your ability to fit the model will depend on the amount of data you have, but more importantly, it depends on the quality of the predictors.
  2. A 10-times rule might be a rule of thumb (and there are many others), but it really depends on the predictive utility of your features. E.g., the iris dataset is fairly small but easily solved, because the features yield good separation of the targets. Conversely, you could have 10 million examples and fail to fit if the features are weak.
HEITZ
  • 911
  • 4
  • 7
0

The "Rule of Ten" might work okay for some problems (e.g. linear, or a known functional form). This is NOT a general rule. All we need is one counter-example to show that this is not a general rule which holds. For proof (and a nice discussion), please see the paper by A. Siegel: https://www.sciencedirect.com/science/article/pii/S2772415821000110

In general, you need to know about your problem. You are guaranteed to be undersampled if you have less than 2^n (n = number of dimensions) samples. Even with this, you have the potential to miss features. Be careful when looking for general rules which have no mathematical backing!