How many features to sample using Random Forests

Question

The Wikipedia page which quotes "The Elements of Statistical Learning" says:

Typically, for a classification problem with $p$ features, $\lfloor \sqrt{p}\rfloor$ features are used in each split.

I understand that this is a fairly good educated guess and it was probably confirmed by empirical evidence, but are there other reasons why one would pick the square root? Is there a statistical phenomenon happening there?

Does this somehow help decrease the variance of the errors?

Is this the same for regression and classification?

score 23 · Accepted Answer · answered Oct 10 '17 at 20:37

I think in the original paper they suggest using $\log_2(N +1$), but either way the idea is the following:

The number of randomly selected features can influence the generalization error in two ways: selecting many features increases the strength of the individual trees whereas reducing the number of features leads to a lower correlation among the trees increasing the strength of the forest as a whole.

What's interesting is that the authors of Random Forests (pdf) find an empirical difference between classification and regression:

An interesting difference between regression and classification is that the correlation increases quite slowly as the number of features used increases.

Therefore, for regression often $N/3$ is recommended, which gives larger values than $\sqrt N$.

In general, there is no clear justification for $\sqrt N$ or $\log N$ for classification problems other than that it has shown that lower correlation among trees can decrease generalization error enough to more than offset the decrease in strength of individual trees. In particular, the authors note that the range where this trade-off can decrease the generalization error is quite large:

The in-between range is usually large. In this range, as the number of features goes up, the correlation increases, but PE*(tree) compensates by decreasing.

(PE* being the generalization error)

As they say in Elements of Statistical Learning:

In practice the best values for these parameters will depend on the problem, and they should be treated as tuning parameters.

One thing your problem can depend on is the number of categorical variables. If you have many categorical variables that are encoded as dummy-variables it usually makes sense to increase the parameter. Again, from the Random Forests paper:

When many of the variables are categorical, using a low [number of features] results in low correlation, but also low strength. [The number of features] must be increased to about two-three times $int(log_2M+1)$ to get enough strength to provide good test set accuracy.

How many features to sample using Random Forests

1 Answers1

Linked