简体   繁体   中英

XGBoost decision tree selection

I have a question regarding which decision tree should I choose from XGBoost .

I will use the following code as an example.

#import packages
import xgboost as xgb
import matplotlib.pyplot as plt

# create DMatrix
df_dmatrix = xgb.DMatrix(data = X, label = y)

# set up parameter dictionary
params = {"objective":"reg:linear", "max_depth":2}

#train the model
xg_reg = xgb.train(params = params, dtrain = df_dmatrix, num_boost_round = 10)

#plot the tree
xgb.plot_tree(xg_reg, num_trees = n) # my question related to here

I create 10 trees in the xg_reg model, and I can plot any one of them by setting n in my last code equal to the index of the tree.

My question is: how can I know which tree best explains the dataset? Is it always the last one? Or should I determine which features I want to include in the tree, and then choose the tree which contains the features?

My question is how I can know which tree explains the data set best?

XGBoost is an implementation of Gradient Boosted Decision Trees (GBDT). Roughly speaking, GBDT is a sequence of trees each one improving the prediction of the previous using residual boosting. So the tree that explains the data best is the n - 1 th.

You can read more about GBDT here

Or should I determine which features I want to include in the tree, and then choose the tree which contains the features?

All the trees are trained with the same base features, they just get residuals added at every boosting iteration. So you could not determine the best tree in this way. In this video there is an intuitive explanation of residuals.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM