简体   繁体   English

如何理解 R 包 xgboost 中的 nfold 和 nrounds

[英]How to understand nfold and nrounds in R's package xgboost

I am trying to use R's package xgboost.我正在尝试使用 R 的包 xgboost。 But there is something I feel confused.但有一点让我感到困惑。 In xgboost manual, under xgb.cv function, it says:在 xgboost 手册中,在 xgb.cv 函数下,它说:

The original sample is randomly partitioned into nfold equal size subsamples.原始样本被随机划分为 n 倍大小相等的子样本。

Of the nfold subsamples, a single subsample is retained as the validation data for testing the model, and the remaining nfold - 1 subsamples are used as training data.在nfold subsamples中,保留单个subsample作为测试模型的验证数据,剩余nfold-1个subsample作为训练数据。

The cross-validation process is then repeated nrounds times, with each of the nfold subsamples used exactly once as the validation data.然后将交叉验证过程重复 n 次,每个 n 倍子样本仅用作验证数据一次。

And this is the code in the manual:这是手册中的代码:

data(agaricus.train, package='xgboost')
dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
cv <- xgb.cv(data = dtrain, nrounds = 3, nthread = 2, nfold = 5, metrics = 
list("rmse","auc"),
max_depth = 3, eta = 1, objective = "binary:logistic")
print(cv)
print(cv, verbose=TRUE)

And the result is:结果是:

##### xgb.cv 5-folds
call:
  xgb.cv(data = dtrain, nrounds = 3, nfold = 5, metrics = list("rmse", 
    "auc"), nthread = 2, max_depth = 3, eta = 1, objective = "binary:logistic")
params (as set within xgb.cv):
  nthread = "2", max_depth = "3", eta = "1", objective = "binary:logistic", 
eval_metric = "rmse", eval_metric = "auc", silent = "1"
callbacks:
  cb.print.evaluation(period = print_every_n, showsd = showsd)
  cb.evaluation.log()
niter: 3
evaluation_log:
 iter train_rmse_mean train_rmse_std train_auc_mean train_auc_std test_rmse_mean test_rmse_std test_auc_mean test_auc_std
1       0.1623756    0.002693092      0.9871108  1.123550e-03      0.1625222   0.009134128     0.9870954 0.0045008818
2       0.0784902    0.002413883      0.9998370  1.317346e-04      0.0791366   0.004566554     0.9997756 0.0003538184
3       0.0464588    0.005172930      0.9998942  7.315846e-05      0.0478028   0.007763252     0.9998902 0.0001347032

Let's say nfold=5 and nrounds=2.假设 nfold=5 和 nrounds=2。 It means the data is splited into 5 parts with equal size.这意味着数据被分成大小相等的 5 部分。 And the algorithm will generate 2 trees.并且该算法将生成 2 棵树。

my understand is: each subsample has to be the validation once.我的理解是:每个子样本必须验证一次。 When one subsample is validation, 2 trees will be generated.当一个子样本被验证时,将生成 2 棵树。 So, we will have 5 sets of trees (one set has 2 trees because nrounds=2).因此,我们将有 5 组树(一组有 2 棵树,因为 nrounds=2)。 Then we check if the evaluation metric varies a lot or not.然后我们检查评估指标是否变化很大。

But the result does not say the same way.但结果并不相同。 one nround value has one line of the evaluation metric, which looks like it already includes the 'cross validation' part.一个 nround 值有一行评估指标,看起来它已经包含了“交叉验证”部分。 So, if 'The cross-validation process is then repeated nrounds times', then how come 'with each of the nfold subsamples used exactly once as the validation data'?因此,如果“交叉验证过程重复 n 次”,那么“每个 n 倍子样本仅用作验证数据一次”是怎么回事?

Those are the means and standard deviations of the scores of the nfold fit-test procedures run at every round in nrounds .这些是在nrounds每轮运行的nfold拟合测试程序的分数的平均值标准差 The XGBoost cross validation process proceeds like this: XGBoost 交叉验证过程如下:

  1. The dataset X is split into nfold subsamples, X 1 , X 2 ...X nfold .数据集 X 被拆分为nfold子样本,X 1 , X 2 ...X nfold
  2. The XGBoost algorithm fits a boosted tree to a training dataset comprising X 1 , X 2 ,...,X nfold-1 , while the last subsample (fold) X nfold is held back as a validation 1 (out-of-sample) dataset. XGBoost 算法将提升树拟合到包含 X 1 , X 2 ,...,X nfold-1的训练数据集,而最后一个子样本(折叠)X nfold被阻止作为验证1 (样本外)数据集。 The chosen evaluation metrics (RMSE, AUC, etc.) are calculated for both the training and validation dataset and retained.为训练和验证数据集计算并保留所选的评估指标(RMSE、AUC 等)。
  3. One subsample (fold) in the training dataset is now swapped with the validation subsample (fold), so the training dataset now comprises X 1 , X 2 , ... , X nfold-2 , X nfold and the validation (out-of-sample) dataset is X nfold-1 .训练数据集中的一个子样本(折叠)现在与验证子样本(折叠)交换,因此训练数据集现在包含 X 1 、 X 2 、... 、 X nfold-2 、 X nfold和验证(out-of -sample ) 数据集是 X nfold-1 Once again, the algorithm fits a boosted tree to the training data, calculates the evaluation scores (for each chosen metric) and so on.再次,该算法将提升树拟合到训练数据,计算评估分数(对于每个选择的指标)等等。
  4. This process repeats nfold times until every subsample (fold) has served both as a part of the training set and as a validation set.这个过程重复n次,直到每个子样本(折叠)既作为训练集的一部分作为验证集。
  5. Now, another boosted tree is added and the process outlined in steps 2-4 is repeated.现在,添加另一个提升树,并重复步骤 2-4 中概述的过程。 This continues until the total number of boosted trees being fitted to the training data is equal to nrounds .这一直持续到适合训练数据的提升树的总数等于nrounds
  6. There are now nfold calculated evaluation scores (times the number of distinct metrics chosen) for each round in nrounds for both the training sets and the validation sets (scores on the validation sets naturally tend to be worse).现在有nfold计算出的评价分数(次选择不同的度量的数目),用于nrounds每一轮对的训练集和验证集两者(在验证组自然分数往往更差)。 The means and standard deviations of the nfold scores is calculated for both the training and validation sets (times the number of distinct metrics chosen) for each round in nrounds and returned in a dataframe with nrounds rows. nfold分数的平均值和标准差是针对nrounds中每一轮的训练和验证集(乘以选择的不同指标的数量)计算的,并在具有nrounds行的数据帧中返回。

1 Note that what I would call the 'validation' set is identified by XGBoost as the 'test' set in the evaluation log 1请注意,我所说的“验证”集被 XGBoost 标识为评估日志中的“测试”集

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM