When I use XGBRegressor to construct a boosted tree model from 8194 or fewer data points (i.e., n_train $\leq$ 8194, where n_train is defined in the code below) and randomly shuffle the data points before training, the fit method is order independent, meaning that it generates the same predictive model each time that it is called. However, when I do the same for 8195 data points, fit is order dependent -- it generates a different predictive model for each call. Why is this?
I have read this paper on XGBoost and nearly all of the XGBoost documentation, and the non-subsampling algorithms described in both appear to be order independent for all n_train. So the source of the order dependence for large-n_train datasets is the mysterious part.
Below is a minimal Python script that illustrates the issue.
import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
M = 2 # number of models to compare
tree_method = 'approx' # tree_method of XGBRegressor. Also try 'hist' and 'exact'.
n_disp = 5 # number of elements of y_test_pred[m] to display
np.set_printoptions(precision=5, linewidth=1000, suppress=True)
------------------------------------------------------------------------------------------
def main_func():
for n_samples in [10243, 10244]:
# Construct X and y
X, y = make_regression(n_samples=n_samples)
# Split X and y for training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
n_train = y_train.shape[0]
# Train the models and use them to predict y_test
model = M * [None]
y_test_pred = M * [None]
for m in range(M):
model[m] = train_model(n_train, X_train, y_train, X_test, y_test, m)
y_test_pred[m] = model[m].predict(X_test)
print('---')
print(f'n_train = {n_train}')
print(f'y_test_pred[m][:{n_disp}] for m = {m}:')
print(y_test_pred[m][:n_disp])
------------------------------------------------------------------------------------------
def train_model(n_train, X_train, y_train, X_test, y_test, m):
# Permute X_train and y_train
p = np.random.permutation(n_train)
X_train = X_train[p]
y_train = y_train[p]
# Construct and train the model
model = XGBRegressor(tree_method=tree_method, random_state=42)
model.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_test, y_test)], verbose=0)
return model
------------------------------------------------------------------------------------------
main_func()
One run of this code yields:
---
n_train = 8194
y_test_pred[m][:5] for m = 0:
[ 138.66483 -20.09365 62.82829 -136.29303 -120.78113]
---
n_train = 8194
y_test_pred[m][:5] for m = 1:
[ 138.66483 -20.09365 62.82829 -136.29303 -120.78113]
---
n_train = 8195
y_test_pred[m][:5] for m = 0:
[ 20.70109 -125.59986 -140.2009 84.15887 -39.79109]
---
n_train = 8195
y_test_pred[m][:5] for m = 1:
[ -26.50723 -159.95743 -79.36356 108.11007 -38.723 ]
Note that for n_train = 8194, y_test_pred[m][:n_disp] is the same for all m, but for n_train = 8195 it is not.
Within the script, observe that I permute the elements of X_train and y_train before each run. I would expect this to have no effect on the model produced by the fitting algorithm given that, to my understanding, the feature values are sorted and binned near the start of the algorithm. However, if I comment out this permutation, the high-n_train order dependence of the algorithm disappears. Also note that within the XGBRegressor call, tree_method can be set to 'approx', 'hist', or 'auto' and random_state can be set to a fixed value without eliminating the order dependence at large n_train.
Finally, there are several comments in the XGBoost documentation that might initially seem relevant:
- The online FAQ for XGBoost states that the issue of "Slightly different result between runs ... could happen, due to non-determinism in floating point summation order and multi-threading. Also, data partitioning changes by distributed framework can be an issue as well. Though the general accuracy will usually remain the same."
- And the Python API Reference states that "Using gblinear boost with shotgun updater is nondeterministic as it uses Hogwild algorithm."
For various reasons, however, I suspect that these notes are either unrelated to or inadequate to explain the abrupt transition to order dependence that I have just described.