[英]How to understand nfold and nrounds in R's package xgboost
I am trying to use R's package xgboost.我正在尝试使用 R 的包 xgboost。 But there is something I feel confused.
但有一点让我感到困惑。 In xgboost manual, under xgb.cv function, it says:
在 xgboost 手册中,在 xgb.cv 函数下,它说:
The original sample is randomly partitioned into nfold equal size subsamples.原始样本被随机划分为 n 倍大小相等的子样本。
Of the nfold subsamples, a single subsample is retained as the validation data for testing the model, and the remaining nfold - 1 subsamples are used as training data.在nfold subsamples中,保留单个subsample作为测试模型的验证数据,剩余nfold-1个subsample作为训练数据。
The cross-validation process is then repeated nrounds times, with each of the nfold subsamples used exactly once as the validation data.然后将交叉验证过程重复 n 次,每个 n 倍子样本仅用作验证数据一次。
And this is the code in the manual:这是手册中的代码:
data(agaricus.train, package='xgboost')
dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
cv <- xgb.cv(data = dtrain, nrounds = 3, nthread = 2, nfold = 5, metrics =
list("rmse","auc"),
max_depth = 3, eta = 1, objective = "binary:logistic")
print(cv)
print(cv, verbose=TRUE)
And the result is:结果是:
##### xgb.cv 5-folds
call:
xgb.cv(data = dtrain, nrounds = 3, nfold = 5, metrics = list("rmse",
"auc"), nthread = 2, max_depth = 3, eta = 1, objective = "binary:logistic")
params (as set within xgb.cv):
nthread = "2", max_depth = "3", eta = "1", objective = "binary:logistic",
eval_metric = "rmse", eval_metric = "auc", silent = "1"
callbacks:
cb.print.evaluation(period = print_every_n, showsd = showsd)
cb.evaluation.log()
niter: 3
evaluation_log:
iter train_rmse_mean train_rmse_std train_auc_mean train_auc_std test_rmse_mean test_rmse_std test_auc_mean test_auc_std
1 0.1623756 0.002693092 0.9871108 1.123550e-03 0.1625222 0.009134128 0.9870954 0.0045008818
2 0.0784902 0.002413883 0.9998370 1.317346e-04 0.0791366 0.004566554 0.9997756 0.0003538184
3 0.0464588 0.005172930 0.9998942 7.315846e-05 0.0478028 0.007763252 0.9998902 0.0001347032
Let's say nfold=5 and nrounds=2.假设 nfold=5 和 nrounds=2。 It means the data is splited into 5 parts with equal size.
这意味着数据被分成大小相等的 5 部分。 And the algorithm will generate 2 trees.
并且该算法将生成 2 棵树。
my understand is: each subsample has to be the validation once.我的理解是:每个子样本必须验证一次。 When one subsample is validation, 2 trees will be generated.
当一个子样本被验证时,将生成 2 棵树。 So, we will have 5 sets of trees (one set has 2 trees because nrounds=2).
因此,我们将有 5 组树(一组有 2 棵树,因为 nrounds=2)。 Then we check if the evaluation metric varies a lot or not.
然后我们检查评估指标是否变化很大。
But the result does not say the same way.但结果并不相同。 one nround value has one line of the evaluation metric, which looks like it already includes the 'cross validation' part.
一个 nround 值有一行评估指标,看起来它已经包含了“交叉验证”部分。 So, if 'The cross-validation process is then repeated nrounds times', then how come 'with each of the nfold subsamples used exactly once as the validation data'?
因此,如果“交叉验证过程重复 n 次”,那么“每个 n 倍子样本仅用作验证数据一次”是怎么回事?
Those are the means and standard deviations of the scores of the nfold fit-test procedures run at every round in nrounds .这些是在nrounds中每轮运行的nfold拟合测试程序的分数的平均值和标准差。 The XGBoost cross validation process proceeds like this:
XGBoost 交叉验证过程如下:
1 Note that what I would call the 'validation' set is identified by XGBoost as the 'test' set in the evaluation log 1请注意,我所说的“验证”集被 XGBoost 标识为评估日志中的“测试”集
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.