简体   繁体   English

R中的RandomForest报告对象中的缺失值,但vector中的NA值为零

[英]RandomForest in R reports missing values in object, but vector has zero NAs in it

I'm trying to use the randomForest package in R, but I've encountered a problem where R tells me that there is missing data in the response vector. 我试图在R中使用randomForest包,但是遇到一个问题,其中R告诉我响应向量中缺少数据。

> rf_blackcomb_earlyGame <- randomForest(max_cohort ~ ., data=blackcomb_earlyGame[-c(1,2), ])
Error in na.fail.default(list(max_cohort = c(47, 25, 20, 37, 1, 0, 23,  : 
missing values in object

The specified error is clear enough. 指定的错误很明显。 I've encountered it before and in the past there actually have been missing data, but this time there aren't any missing data. 我以前遇到过它,过去确实缺少数据,但是这次没有任何丢失的数据。

> class(blackcomb_earlyGame$max_cohort)
[1] "numeric"
> which(is.na(blackcomb_earlyGame$max_cohort))
integer(0)

I've tried using na.roughfix to see if that will help, but I get the following error. 我尝试使用na.roughfix来查看是否有帮助,但是出现以下错误。

Error in na.roughfix.data.frame(list(max_cohort = c(47, 25, 20, 37, 1,  : 
na.roughfix only works for numeric or factor

I've checked every vector to make sure that none of them contain any NAs, and none of them do. 我检查了每个向量,以确保它们都不包含任何NA,并且它们都不包含。

Does anyone have any suggestions? 有没有人有什么建议?

randomForest can fail due to a few different types of issues with the data. 由于数据的几种不同类型的问题, randomForest可能会失败。 Missing values ( NA ), values of NaN , Inf or -Inf , and character types that have not been cast into factors will all fail, with a variety of error messages. 缺失值( NA )的数值NaNInf-Inf没有被投进去的因素,和性格类型将全部失败,与各种错误消息。

We can see below some examples of the error messages generated by each of these issues: 我们可以在下面看到一些由这些问题产生的错误消息的示例:

my.df <- data.frame(a = 1:26, b=letters, c=(1:26)+rnorm(26))
rf <- randomForest(a ~ ., data=my.df)
# this works without issues, because b=letters is cast into a factor variable by default

my.df$d <- LETTERS    # Now we add a character column
rf <- randomForest(a ~ ., data=my.df)
# Error in randomForest.default(m, y, ...) : 
#   NA/NaN/Inf in foreign function call (arg 1)
# In addition: Warning message:
#   In data.matrix(x) : NAs introduced by coercion

rf <- randomForest(d ~ ., data=my.df)
# Error in y - ymean : non-numeric argument to binary operator
# In addition: Warning message:
#   In mean.default(y) : argument is not numeric or logical: returning NA

my.df$d <- c(NA, rnorm(25))
rf <- randomForest(a ~ ., data=my.df)
rf <- randomForest(d ~ ., data=my.df)
# Error in na.fail.default(list(a = 1:26, b = 1:26, c = c(3.14586293058335,  : 
#   missing values in object

my.df$d <- c(Inf, rnorm(25))
rf <- randomForest(a ~ ., data=my.df)
rf <- randomForest(d ~ ., data=my.df)
# Error in randomForest.default(m, y, ...) : 
#   NA/NaN/Inf in foreign function call (arg 1)

Interestingly, the error message you received, which was caused by having a character type in the data frame (see comments ), is the error that I see when there is a numeric column with NA . 有趣的是,您收到的错误消息是由于在数据框中具有character类型而引起的(请参见注释 ),这是我在存在带有NA的数字列时看到的错误。 This suggests that there may either be (1) differences in the errors from different versions of randomForest or (2) that the error message depends in more complex ways on the structure of the data. 这表明,要么(1)来自randomForest不同版本的错误有所不同,要么(2)错误消息以更复杂的方式取决于数据的结构。 Either way, the advice for anyone receiving errors such as these is to look for all of the possible issues with the data listed above, in order to track down the cause. 无论哪种方式,对于任何接收到此类错误的人,建议都是使用上面列出的数据查找所有可能的问题,以便找出原因。

Perhaps there are Inf or -Inf values? 也许有Inf-Inf值?

is.na(c(1, NA, Inf, NaN, -Inf))
#[1] FALSE  TRUE FALSE  TRUE FALSE

is.finite(c(1, NA, Inf, NaN, -Inf))
#[1]  TRUE FALSE FALSE FALSE FALSE

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM