1

If I want to train a regression model through tree based algorithms like XGBoost. Suppose that there have 5 features x1, x2, x3, x4, x5 and a target y. And some experts said x2 minus x3 is highly correlate to y. Should I put x2-x3 in the model as the sixth feature, or XGBoost will automatically learn it by just put x1~x5 in model.

As I know, a linear mode can learn a formula from features, and how about tree based methods? If tree based can also do the same thing, does the size of data matter?

PoCheng.Lin
  • 155
  • 2
  • 8

1 Answers1

2

XGBoost will not learn "interactions" on its own. Feature generation is often used to enhance the explanatory power of $X$. Often $x_n - x_k$ or $x_n / x_k$ are checked and used. There are also tools for feature generation, e.g. "Featuretools" for Python.

One thing you can do to find out what kind of interactions have the most explanatory power, you can fit trees with only few splits (three or so) on all the possible interactions (one interaction after another, so one shallow model per interaction) and check the prediction (e.g. MSE, MAE) for each case, such as:

$$ y(x_1-x_2), y(x_1/x_2), ..., y(x_1-x_n), y(x_1/x_n),$$ $$ y(x_2-x_1), y(x_2/x_1), ..., y(x_2-x_n), y(x_2/x_n),$$ $$...$$

You could keep only those interactions which have "high" explanatory power so to avoid having a massive amount of features in the model.

Peter
  • 7,896
  • 5
  • 23
  • 50