R 中 gbm 的缺失数据和分层 k 折交叉验证

Question

I have a relatively large data set on sales of homes in several markets in the US For each market, I want to build a Gradient Boosting regression model to predict the sale price.我有一个相对较大的美国几个市场房屋销售数据集对于每个市场，我想建立一个梯度提升回归模型来预测销售价格。 Most of my independent variables (features) have missing values which should be fine for a gbm in R.我的大多数自变量（特征）都有缺失值，这对于 R 中的gbm应该没问题。

The gbm algorithm in caret requires you to specify values of the hyperparameters ( n.trees , shrinkage , interaction.depth , n.minobsinnode , etc). caret的gbm算法要求您指定超参数的值（ n.trees 、 shrinkage 、 interaction.depth 、 n.minobsinnode等）。 I want to do a grid search in conjunction with cross validation to pick the best set of hyperparameters:我想结合交叉验证进行网格搜索以选择最佳超参数集：

# -------- A function to drop variables that are more than 80% missing or have no variance
Drop_Useless_Vars <- function(datan) {
  n = nrow(datan)
  p = ncol(datan)
  na = is.na(datan)
  n_na = apply(X = na, MARGIN = 2, FUN = sum)
  n_unique = apply(X = datan, MARGIN = 2, function(x) length(na.omit((unique(x)))))
  return(as.data.frame(datan[, -which(n_na > 0.8*n | n_unique < 2)]))
}

# -------- load libraries
library(gbm)
library(caret)

# -------- prepare training scheme
control = trainControl(method = "cv", number = 5)

# -------- design the parameter tuning grid 
grid = expand.grid(n.trees = 10000, 
                   interaction.depth = seq(2, 10, 1), 
                   n.minobsinnode = c(3, 4, 5), 
                   shrinkage = c(0.1, 0.01, 0.001))

# -------- tune the parameters
tuner = train(log(saleprice) ~ ., data = Drop_Useless_Vars(df), method = "gbm", distribution = "gaussian",
              trControl = control, verbose = FALSE, tuneGrid = grid, metric = "RMSE")

# -------- get the best combo
n_trees = tuner$bestTune$n.trees
interaction_depth = tuner$bestTune$interaction.depth
shrinkage = tuner$bestTune$shrinkage
n_minobsinnode = tuner$bestTune$n.minobsinnode

The above code works fine except for some markets where missing values are much more frequent.上面的代码工作正常，除了一些缺失值更频繁的市场。 I'm getting the error shown below:我收到如下所示的错误：

Error in checkForRemoteErrors(val) : 
  4 nodes produced errors; first error: variable 26: assessor_full_baths has only missing values.

assessor_full_baths is one of the features in my model. assessor_full_baths是我的模型中的特征之一。 So what's happening is that when the algorithm is sampling data to do the cross validation, one or more of the folds are having variables that are completely missing.所以发生的事情是，当算法对数据进行采样以进行交叉验证时，一个或多个折叠具有完全缺失的变量。

How can I stratify the sampling scheme used by caret ?如何对caret使用的抽样方案进行分层？ That is, how can I force each fold to have the same distribution with respect to the missing values?也就是说，我如何强制每个折叠对缺失值具有相同的分布？ Also, do you guys know how to make the gbm function ignore variables that are completely missing without us telling it which ones they are?另外，你们知道如何让gbm函数忽略完全丢失的变量而不告诉我们它们是哪些变量吗？

I will be grateful for any help you can provide.如果您能提供任何帮助，我将不胜感激。

Answer 1

I think you need to take a step back and think about how to properly handle the data and fit the model.我认为您需要退后一步，考虑如何正确处理数据并拟合模型。

Your final modeling data set shouldn't have any missing values, let alone so many missing values that features are 100% missing in CV folds (!).您的最终建模数据集不应有任何缺失值，更不用说在 CV 折叠中 100% 缺失特征的缺失值（！）。

Instead, do a bit of data cleaning and feature engineering:相反，做一些数据清理和特征工程：

you can add NA as a factor level to factor variables您可以将 NA 作为因子水平添加到因子变量
you can impute missing values in most other situations您可以在大多数其他情况下估算缺失值
if a feature has high missingness, then you probably need to exclude it, or如果一个特征缺失率很高，那么你可能需要排除它，或者
if it's not missing at random and there many be some information you can learn by the fact that it was missing / present, you might create one or a few indicator variables indicating if original feature was missing and/or if it contained a certain level of interest.如果它不是随机丢失的，并且有很多信息可以通过丢失/存在的事实来了解，那么您可以创建一个或几个指示变量，指示原始特征是否丢失和/或它是否包含一定级别的兴趣。

If you don't want to impute the missing data then you should use the other strategies mentioned.如果您不想估算缺失的数据，那么您应该使用提到的其他策略。 The missing node is not the same as encoding an NA level.丢失的节点与编码 NA 级别不同。 The missing node just means that the tree will give the same prediction as prior to the split where data was missing (see R gbm handling of missing values ).缺失节点仅意味着树将给出与数据缺失的分割之前相同的预测（请参阅R gbm 处理缺失值）。 You should never build a model on a feature with high missingness and it would not be a good practice at all to build a model on data with any missing values.您永远不应该在具有高缺失值的特征上构建模型，并且在具有任何缺失值的数据上构建模型根本不是一个好习惯。 Those should be cleaned when you prepare the data set.在准备数据集时应该清理这些。

Having said that, you can still impute MNAR data.话虽如此，您仍然可以估算 MNAR 数据。 There are a large number of strategies, going back to Heckman's and Rubin's work in the 70's.有大量的策略，可以追溯到 70 年代 Heckman 和 Rubin 的工作。 A lot of people use mice() and the drawn indicator method .很多人使用mice()和绘制指标方法。

This may help: http://stefvanbuuren.nl/mi/docs/mnar.pdf这可能会有所帮助： http : //stefvanbuuren.nl/mi/docs/mnar.pdf

Answer 2

You can use caret::createFolds to create the CV folds yourself.您可以使用caret::createFolds自己创建 CV 折叠。 You simply need to supply a different y variable than your outcome (ie- saleprice ).你只需要提供不同的y比你的结果变量（即- saleprice ）。 You'll want to define a new variable based on your stratification variables... see ?interaction .您需要根据分层变量定义一个新变量……请参阅?interaction 。 This can then be passed to caret::trainControl然后可以将其传递给caret::trainControl

For example:例如：

library(caret)
y2 <- interaction(df$x1, df$x2)
cv_folds <- createFolds(y2, k= 5)
control = trainControl(index= cv_folds, method= "cv", number= 5)
...

Or you can avoid using caret and write your own stratified sampling code或者您可以避免使用caret并编写自己的分层抽样代码

That said, I agree with most everything @Hack-R said about the practical use of features with missing values and imputation.也就是说，我同意@Hack-R 所说的关于缺失值和插补功能的实际使用的大部分内容。

R 中 gbm 的缺失数据和分层 k 折交叉验证

问题描述

2 个解决方案

解决方案1
0 2016-09-29 01:01:21

解决方案2
0 2016-09-29 20:48:57

R 中 gbm 的缺失数据和分层 k 折交叉验证

问题描述

2 个解决方案

解决方案1 0 2016-09-29 01:01:21

解决方案2 0 2016-09-29 20:48:57

解决方案1
0 2016-09-29 01:01:21

解决方案2
0 2016-09-29 20:48:57