xgboost cannot identify perfectly fitting regression line

Question

For a dataset I want to use xgboost for the optimal ensembling of $n$ forecasts instead of just using their arithmetic mean for combination. I found that xgboost generates forecasts that are worse than many of the $n$ individual forecasts the moedl could choose for combination.

I do not know why this can be the case. For the illustration of my observation I created the toy dataset below. The artificial target variable is generated by $$y = \frac{x_1+x_2}{2} \, \,\mbox{with } x_1, x_2 \sim N(0,1) $$ Given the deterministic relationship between $y$ and the two explanatory variables $x_1$ and $x_2$, xgboost could make perfect forecasts, but it does not. The linear model easily does. Since this is the most simple multivariate linear regression model I can think of and xgboost fails I wondering about the implications.

Why is this the case? What are the limitations of tree models for regression?
Why is then xgboost used for stacking and ensembling of forecasts if it cannot reproduce the MSE minimizing arithmetic mean as optimal combination mechanism?

Note that the parameters of xgboost do not affect that. I tried many parameter settings and the results are never perfect.

Data Generation

library(tidyverse)
library(xgboost)
n <- 1000
param0 <- list("objective"  = "reg:linear", "eval_metric" = "rmse")
set.seed(1)
df <- tibble(x1 = rnorm(n), x2 = rnorm(n), y = (x1+x2)/2)

xgboost

xgtrain <- xgb.DMatrix(as.matrix(df[1:900,c("x1","x2")]), label = df$y[1:900], missing = NA)
xgtest <- xgb.DMatrix(as.matrix(df[901:1000,c("x1","x2")]), missing = NA)
#Crossvalidation just to illustrate that the algorithm 
#learns something that is not correct since the test data 
#cannot be forecasted with 0 error. 
#xgb.cv(nrounds = 100,nfold = 10, params = param0, data = xgtrain)  
#nrounds and other parameters do not not get you to the prefect forecast
model <- xgb.train(nrounds = 100, params = param0, data = xgtrain)  
preds_xgb <- predict(model, xgtest)
#no perfect forecasts
sqrt(mean((preds_xgb-df$y[901:1000])^2))
0.04654448

Linear regression

model <- lm(y ~ x1+x2, data = df[1:900,])
#0.5 and 0.5 for x1 and x2 as expected
model$coefficients 
preds_lm <- predict(model, df[901:1000,c("x1","x2")])
#perfect forecasts
sqrt(mean((preds_lm-df$y[901:1000])^2))
1.389314e-15

score 2 · Answer 1 · answered Jun 02 '18 at 06:40

I think that the reason for this to happen is that tree-based methods have problems with linear problems. This is because tree-based methods do partitions of the variables, and not on combinations of the variables. To fit a linear regression, a tree-based method will have to do a lot of partitions to obtain low error. However, in principle, using enough deep trees you should be able to overfit your training data, although it might take many trees.

If your concern is to make a perfect forecast, no tree based method is able to do a perfect forecast, and this happens with most kind of data. As your data is linear, you happen to be able to do a perfect forecast with linear regression, but this won't happen in real life.

Aditya · Answer 2 · 2018-06-03T01:42:32.783

(adding to what's said above by @David),

The short answer is that,

You can't expect the tree based models to Extrapolate...

Had asked on Slack (quoting miguel_perez)and this was the reply, realize that in your example you are aproximating a line with an staircase. Even discarding other errors suspect number one would be not enough data points. Trees are just not the proper tool to approximate lines especially with not enough data...

Or you don't have enough features to do the same...(cpmp)

Also if you want to do regression only, then we have different regressors, Vowpal Wabbit, KNN etc...

xgboost cannot identify perfectly fitting regression line

2 Answers2