如何理解 R 包 xgboost 中的 nfold 和 nrounds

Question

我正在尝试使用 R 的包 xgboost。 但有一点让我感到困惑。 在 xgboost 手册中，在 xgb.cv 函数下，它说：

原始样本被随机划分为 n 倍大小相等的子样本。

在nfold subsamples中，保留单个subsample作为测试模型的验证数据，剩余nfold-1个subsample作为训练数据。

然后将交叉验证过程重复 n 次，每个 n 倍子样本仅用作验证数据一次。

这是手册中的代码：

data(agaricus.train, package='xgboost')
dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
cv <- xgb.cv(data = dtrain, nrounds = 3, nthread = 2, nfold = 5, metrics = 
list("rmse","auc"),
max_depth = 3, eta = 1, objective = "binary:logistic")
print(cv)
print(cv, verbose=TRUE)

结果是：

##### xgb.cv 5-folds
call:
  xgb.cv(data = dtrain, nrounds = 3, nfold = 5, metrics = list("rmse", 
    "auc"), nthread = 2, max_depth = 3, eta = 1, objective = "binary:logistic")
params (as set within xgb.cv):
  nthread = "2", max_depth = "3", eta = "1", objective = "binary:logistic", 
eval_metric = "rmse", eval_metric = "auc", silent = "1"
callbacks:
  cb.print.evaluation(period = print_every_n, showsd = showsd)
  cb.evaluation.log()
niter: 3
evaluation_log:
 iter train_rmse_mean train_rmse_std train_auc_mean train_auc_std test_rmse_mean test_rmse_std test_auc_mean test_auc_std
1       0.1623756    0.002693092      0.9871108  1.123550e-03      0.1625222   0.009134128     0.9870954 0.0045008818
2       0.0784902    0.002413883      0.9998370  1.317346e-04      0.0791366   0.004566554     0.9997756 0.0003538184
3       0.0464588    0.005172930      0.9998942  7.315846e-05      0.0478028   0.007763252     0.9998902 0.0001347032

假设 nfold=5 和 nrounds=2。 这意味着数据被分成大小相等的 5 部分。 并且该算法将生成 2 棵树。

我的理解是：每个子样本必须验证一次。 当一个子样本被验证时，将生成 2 棵树。 因此，我们将有 5 组树（一组有 2 棵树，因为 nrounds=2）。 然后我们检查评估指标是否变化很大。

但结果并不相同。 一个 nround 值有一行评估指标，看起来它已经包含了“交叉验证”部分。 因此，如果“交叉验证过程重复 n 次”，那么“每个 n 倍子样本仅用作验证数据一次”是怎么回事？

Answer 1

这些是在nrounds中每轮运行的nfold拟合测试程序的分数的平均值和标准差。 XGBoost 交叉验证过程如下：

数据集 X 被拆分为nfold子样本，X ₁ , X ₂ ...X _nfold 。
XGBoost 算法将提升树拟合到包含 X ₁ , X ₂ ,...,X _nfold-1的训练数据集，而最后一个子样本（折叠）X _nfold被阻止作为验证¹ （样本外）数据集。 为训练和验证数据集计算并保留所选的评估指标（RMSE、AUC 等）。
训练数据集中的一个子样本（折叠）现在与验证子样本（折叠）交换，因此训练数据集现在包含 X ₁ 、 X ₂ 、... 、 X _nfold-2 、 X _nfold和验证（out-of _-sample ) 数据集是 X _nfold-1 。 再次，该算法将提升树拟合到训练数据，计算评估分数（对于每个选择的指标）等等。
这个过程重复n次，直到每个子样本（折叠）既作为训练集的一部分又作为验证集。
现在，添加另一个提升树，并重复步骤 2-4 中概述的过程。 这一直持续到适合训练数据的提升树的总数等于nrounds 。
现在有nfold计算出的评价分数（次选择不同的度量的数目），用于nrounds每一轮对的训练集和验证集两者（在验证组自然分数往往更差）。 nfold分数的平均值和标准差是针对nrounds中每一轮的训练和验证集（乘以选择的不同指标的数量）计算的，并在具有nrounds行的数据帧中返回。

¹请注意，我所说的“验证”集被 XGBoost 标识为评估日志中的“测试”集

如何理解 R 包 xgboost 中的 nfold 和 nrounds

问题描述

1 个解决方案

解决方案1
5 已采纳 2019-02-07 04:38:56

如何理解 R 包 xgboost 中的 nfold 和 nrounds

问题描述

1 个解决方案

解决方案1 5 已采纳 2019-02-07 04:38:56

解决方案1
5 已采纳 2019-02-07 04:38:56