[英]Missing data and stratified k-fold cross validation of a gbm in R
I have a relatively large data set on sales of homes in several markets in the US For each market, I want to build a Gradient Boosting regression model to predict the sale price.我有一个相对较大的美国几个市场房屋销售数据集对于每个市场,我想建立一个梯度提升回归模型来预测销售价格。 Most of my independent variables (features) have missing values which should be fine for a
gbm
in R.我的大多数自变量(特征)都有缺失值,这对于 R 中的
gbm
应该没问题。
The gbm
algorithm in caret
requires you to specify values of the hyperparameters ( n.trees
, shrinkage
, interaction.depth
, n.minobsinnode
, etc). caret
的gbm
算法要求您指定超参数的值( n.trees
、 shrinkage
、 interaction.depth
、 n.minobsinnode
等)。 I want to do a grid search in conjunction with cross validation to pick the best set of hyperparameters:我想结合交叉验证进行网格搜索以选择最佳超参数集:
# -------- A function to drop variables that are more than 80% missing or have no variance
Drop_Useless_Vars <- function(datan) {
n = nrow(datan)
p = ncol(datan)
na = is.na(datan)
n_na = apply(X = na, MARGIN = 2, FUN = sum)
n_unique = apply(X = datan, MARGIN = 2, function(x) length(na.omit((unique(x)))))
return(as.data.frame(datan[, -which(n_na > 0.8*n | n_unique < 2)]))
}
# -------- load libraries
library(gbm)
library(caret)
# -------- prepare training scheme
control = trainControl(method = "cv", number = 5)
# -------- design the parameter tuning grid
grid = expand.grid(n.trees = 10000,
interaction.depth = seq(2, 10, 1),
n.minobsinnode = c(3, 4, 5),
shrinkage = c(0.1, 0.01, 0.001))
# -------- tune the parameters
tuner = train(log(saleprice) ~ ., data = Drop_Useless_Vars(df), method = "gbm", distribution = "gaussian",
trControl = control, verbose = FALSE, tuneGrid = grid, metric = "RMSE")
# -------- get the best combo
n_trees = tuner$bestTune$n.trees
interaction_depth = tuner$bestTune$interaction.depth
shrinkage = tuner$bestTune$shrinkage
n_minobsinnode = tuner$bestTune$n.minobsinnode
The above code works fine except for some markets where missing values are much more frequent.上面的代码工作正常,除了一些缺失值更频繁的市场。 I'm getting the error shown below:
我收到如下所示的错误:
Error in checkForRemoteErrors(val) :
4 nodes produced errors; first error: variable 26: assessor_full_baths has only missing values.
assessor_full_baths
is one of the features in my model. assessor_full_baths
是我的模型中的特征之一。 So what's happening is that when the algorithm is sampling data to do the cross validation, one or more of the folds are having variables that are completely missing.所以发生的事情是,当算法对数据进行采样以进行交叉验证时,一个或多个折叠具有完全缺失的变量。
How can I stratify the sampling scheme used by caret
?如何对
caret
使用的抽样方案进行分层? That is, how can I force each fold to have the same distribution with respect to the missing values?也就是说,我如何强制每个折叠对缺失值具有相同的分布? Also, do you guys know how to make the
gbm
function ignore variables that are completely missing without us telling it which ones they are?另外,你们知道如何让
gbm
函数忽略完全丢失的变量而不告诉我们它们是哪些变量吗?
I will be grateful for any help you can provide.如果您能提供任何帮助,我将不胜感激。
I think you need to take a step back and think about how to properly handle the data and fit the model.我认为您需要退后一步,考虑如何正确处理数据并拟合模型。
Your final modeling data set shouldn't have any missing values, let alone so many missing values that features are 100% missing in CV folds (!).您的最终建模数据集不应有任何缺失值,更不用说在 CV 折叠中 100% 缺失特征的缺失值(!)。
Instead, do a bit of data cleaning and feature engineering:相反,做一些数据清理和特征工程:
If you don't want to impute the missing data then you should use the other strategies mentioned.如果您不想估算缺失的数据,那么您应该使用提到的其他策略。 The missing node is not the same as encoding an NA level.
丢失的节点与编码 NA 级别不同。 The missing node just means that the tree will give the same prediction as prior to the split where data was missing (see R gbm handling of missing values ).
缺失节点仅意味着树将给出与数据缺失的分割之前相同的预测(请参阅R gbm 处理缺失值)。 You should never build a model on a feature with high missingness and it would not be a good practice at all to build a model on data with any missing values.
您永远不应该在具有高缺失值的特征上构建模型,并且在具有任何缺失值的数据上构建模型根本不是一个好习惯。 Those should be cleaned when you prepare the data set.
在准备数据集时应该清理这些。
Having said that, you can still impute MNAR data.话虽如此,您仍然可以估算 MNAR 数据。 There are a large number of strategies, going back to Heckman's and Rubin's work in the 70's.
有大量的策略,可以追溯到 70 年代 Heckman 和 Rubin 的工作。 A lot of people use
mice()
and the drawn indicator method .很多人使用
mice()
和绘制指标方法。
This may help: http://stefvanbuuren.nl/mi/docs/mnar.pdf这可能会有所帮助: http : //stefvanbuuren.nl/mi/docs/mnar.pdf
You can use caret::createFolds
to create the CV folds yourself.您可以使用
caret::createFolds
自己创建 CV 折叠。 You simply need to supply a different y
variable than your outcome (ie- saleprice
).你只需要提供不同的
y
比你的结果变量(即- saleprice
)。 You'll want to define a new variable based on your stratification variables... see ?interaction
.您需要根据分层变量定义一个新变量……请参阅
?interaction
。 This can then be passed to caret::trainControl
然后可以将其传递给
caret::trainControl
For example:例如:
library(caret)
y2 <- interaction(df$x1, df$x2)
cv_folds <- createFolds(y2, k= 5)
control = trainControl(index= cv_folds, method= "cv", number= 5)
...
Or you can avoid using caret
and write your own stratified sampling code或者您可以避免使用
caret
并编写自己的分层抽样代码
That said, I agree with most everything @Hack-R said about the practical use of features with missing values and imputation.也就是说,我同意@Hack-R 所说的关于缺失值和插补功能的实际使用的大部分内容。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.