12

From the tutorial of the XGBoost, I think when each tree grows, all the variables are scanned to be selected to split nodes, and the one with the maximum gain split will be chosen. So my question is that what if I add some noise variables into the data set, would these noise variables influence the selection of variables (for each tree growing)? My logic is that because these noise variables do NOT give maximum gain split at all, then they would never be selected thus they do not influence the tree growth.

If the answer is yes, then is it true that "the more variables the better for XGBoost"? Let's not consider the training time.

Also, if the answer is yes, then is it true that "we do not need to filter out non-important variables from the model".

Thank you!

WCMC
  • 465
  • 1
  • 5
  • 11

1 Answers1

12

My logic is that because these noise variables do NOT give maximum gain split at all, then they would never be selected thus they do not influence the tree growth.

This is only perfectly correct for very large, near infinite data sets, where the number of samples in your training set gives good coverage of all variations. In practice, with enough dimensions you end up with a lot of sampling noise, because your coverage of possible examples is weaker the more dimensions your data has.

Noise on weak variables that ends up correlating by chance with the target variable can limit the effectiveness of boosting algorithms, and this can more easily happen on deeper splits in the decision tree, where the data being assessed has already been grouped into a small subset.

The more variables you add, the more likely it is that you will get weakly correlated variables that just happen to look good to the split selection algorithm for some specific combination, which then creates trees that learn this noise instead of the intended signal, and ultimately generalise badly.

In practise, I have found XGBoost quite robust to noise on a small scale. However, I have also found that it will sometimes select poor quality engineered variables, in preference to better-correlated data, for similar reasons. So it is not an algorithm where "the more variables the better for XGBoost" and you do need to care about possible low-quality features.

Neil Slater
  • 29,388
  • 5
  • 82
  • 101