简体   繁体   English

在 rpart 和随机森林中处理跳过

[英]Handling skip in rpart and random forest

I have a dataset containing 10 categorical variables.我有一个包含 10 个分类变量的数据集。 Each of these has missing values coded as (-9, -6, -3, -2, -1).每一个都有缺失值编码为 (-9, -6, -3, -2, -1)。 I want to create 1 column that takes the mean of these 10 variables excluding the negative values.我想创建 1 列,取这 10 个变量的平均值,不包括负值。 I can collapse the negative values into NA and then median impute them but I need to retain -6 since -6 implies that the person skipped the question because it does not apply to them.我可以将负值折叠为 NA 然后中值估算它们,但我需要保留 -6 因为 -6 意味着该人跳过了这个问题,因为它不适用于他们。 For instance, parental relationship quality does not apply to single parents.例如,父母关系质量不适用于单亲父母。 I ultimately want to use this variable as a predictor in my random forest model so I am not sure how to handle -6 in this case.我最终想在我的随机森林模型中使用这个变量作为预测变量,所以我不确定在这种情况下如何处理 -6。 One way that I could think of is to impute each of the 10 variables as follows (Let's say that the 10 variables are a1 to a10):我能想到的一种方法是按如下方式估算 10 个变量中的每一个(假设 10 个变量是 a1 到 a10):

missing_categs <- c(-9, -3, -2, -1)

df[df$a1%in%missing_categs,]$a1 <- assign median value of a1

After the above step, I calculate the average of a1 to a10.经过上面的步骤,我计算了 a1 到 a10 的平均值。 The ones that yield "-6" are the ones that pertain to single parents (which means it does not apply to them).产生“-6”的那些是与单亲父母有关的(这意味着它不适用于他们)。 then, I convert -6 to NA.然后,我将 -6 转换为 NA。 So, now I have average values and one NA.所以,现在我有平均值和一个 NA。 Can rpart and random forest models handle NA? rpart 和随机森林模型可以处理 NA 吗? Other better alternative solutions are most welcome.其他更好的替代解决方案是最受欢迎的。 Thanks in advance!提前致谢!

Can rpart and random forest models handle NA? rpart 和随机森林模型可以处理 NA 吗?

I do not know what you mean with handle .我不知道你对handle 的意思。 If you mean that you can use NA in the predictors than the answer is yes for rpart如果您的意思是可以在预测变量中使用NA ,那么rpart的答案是肯定

> library(rpart)
> df <- data.frame(c(1, 2, NA), c(4, 5, 6))
> rpart(df, na.action=na.pass)
n= 3 

node), split, n, deviance, yval
      * denotes terminal node

but no for randomForest不是randomForest

> library(randomForest)
randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.
> df <- data.frame(c(1, 2, NA), c(4, 5, 6))
> randomForest(df, na.action=na.pass)
Error in randomForest.default(df, na.action = na.pass) : 
  NA not permitted in predictors

If you mean handle that they are able to deal with them in some manner, for example by giving them a function, than the answer is yes for both .如果您的意思是处理他们能够以某种方式处理它们,例如通过给它们一个函数,那么两者的答案都是肯定的

rpart and randomForest have the parameter na.action which you can use. rpartrandomForest具有您可以使用的参数na.action See here for rpart and here for randomForest .请参阅此处了解rpart此处了解randomForest

The default for rpart na.action is na.rpart which deletes "all observations for which y is missing" and "those in which one or more predictors are missing" are kept. rpart na.action的默认值是na.rpart ,它删除“所有 y 缺失的观察”“一个或多个预测变量缺失的观察被保留。

The default for randomForest na.action is na.fail which returns the given data structure unaltered if no NA 's are found, and if at least one NA is found it "signals an error" . randomForest na.action的默认值是na.fail如果没有找到NA ,它返回给定的数据结构不变,如果至少找到一个NA ,它“发出错误信号”

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM