Pandas Dataframe to DMatrix

Question

I am trying to run xgboost in scikit learn. And I am only using Pandas to load the data into a dataframe. How am I supposed to use pandas df with xgboost? I am confused by the DMatrix routine required to run the xgboost algorithm.

score 28 · Answer 1 · answered Jul 15 '16 at 14:12

You can use the dataframe's .values method to access raw data once you have manipulated the columns as you need them.

E.g.

train = pd.read_csv("train.csv")
target = train['target']
train = train.drop(['ID','target'],axis=1)
test = pd.read_csv("test.csv")
test = test.drop(['ID'],axis=1)

xgtrain = xgb.DMatrix(train.values, target.values)
xgtest = xgb.DMatrix(test.values)

Obviously you may need to change which columns you drop or use as the training target. The above was for a Kaggle competition, so there was no target data for xgtest (it is held back by the organisers).

score 12 · Answer 2 · answered Jan 11 '19 at 01:02

You can now use Pandas DataFrames directly with XGBoost. Definitely works with xgboost 0.81.

For example where X_train, X_val, y_train, and y_val are DataFrames:

import xgboost as xgb

mod = xgb.XGBRegressor(
    gamma=1,                 
    learning_rate=0.01,
    max_depth=3,
    n_estimators=10000,                                                                    
    subsample=0.8,
    random_state=34
) 

mod.fit(X_train, y_train)
predictions = mod.predict(X_val)
rmse = sqrt(mean_squared_error(y_val, predictions))
print("score: {0:,.0f}".format(rmse))

score 8 · Answer 3 · edited Mar 11 '21 at 19:14

8

There is some good news there is a library pandas_ml which supports XGBoost. This will probably this streamline the workflow simply.

edited Mar 11 '21 at 19:14

Ethan

1,657
9
25
39

answered Apr 15 '17 at 01:56

user4959

191
1
1

Pandas Dataframe to DMatrix

3 Answers3