I am trying to run xgboost in scikit learn. And I am only using Pandas to load the data into a dataframe. How am I supposed to use pandas df with xgboost? I am confused by the DMatrix routine required to run the xgboost algorithm.
Asked
Active
Viewed 4.8k times
3 Answers
28
You can use the dataframe's .values method to access raw data once you have manipulated the columns as you need them.
E.g.
train = pd.read_csv("train.csv")
target = train['target']
train = train.drop(['ID','target'],axis=1)
test = pd.read_csv("test.csv")
test = test.drop(['ID'],axis=1)
xgtrain = xgb.DMatrix(train.values, target.values)
xgtest = xgb.DMatrix(test.values)
Obviously you may need to change which columns you drop or use as the training target. The above was for a Kaggle competition, so there was no target data for xgtest (it is held back by the organisers).
Neil Slater
- 29,388
- 5
- 82
- 101
12
You can now use Pandas DataFrames directly with XGBoost. Definitely works with xgboost 0.81.
For example where X_train, X_val, y_train, and y_val are DataFrames:
import xgboost as xgb
mod = xgb.XGBRegressor(
gamma=1,
learning_rate=0.01,
max_depth=3,
n_estimators=10000,
subsample=0.8,
random_state=34
)
mod.fit(X_train, y_train)
predictions = mod.predict(X_val)
rmse = sqrt(mean_squared_error(y_val, predictions))
print("score: {0:,.0f}".format(rmse))
jeffhale
- 410
- 1
- 5
- 9