R 中的奇怪分类和 eval 中的错误（predvars，data，env）：object [...] 未找到错误（没有错字！）

Question

I am very new to R and have spent hours trying to solve the following problems (which I sense may be interrelated).我对 R 非常陌生，并且花了几个小时试图解决以下问题（我觉得这可能是相互关联的）。 I have read other answers mainly suggesting that there may be a typo in the DRB1 column.我已阅读其他答案，主要表明 DRB1 列中可能存在拼写错误。 This is definitely now the case here and I am wondering if the error is earlier on since the confusion matrix shows that everything is classified as "-" while about 20% of factors in "Response" are "+":现在肯定是这种情况，我想知道错误是否更早，因为混淆矩阵显示所有内容都被归类为“-”，而“响应”中大约 20% 的因素是“+”：

Call:
 randomForest(formula = Response ~ ., data = training, ntree = 21, importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 21
No. of variables tried at each split: 1

OOB estimate of  error rate: 20.24%
Confusion matrix:

  

   -     +   class.error
 - 17504 0           0
 +  4443 0           1

I am trying to use the random forest fn to predict the response variable "Response" by column DRB1:我正在尝试使用随机森林 fn 来预测 DRB1 列的响应变量“Response”：

library(randomForest)
#library(xlsx) 
library(xlsx)

# load data
path <- "C:/Users/[...].xlsx" #I have only removed the path for privacy reasons
data <- read.xlsx(path, sheetIndex = 2)

# show data [FINE]
head(data)

# make some data categorical
data$DRB1=as.factor(data$DRB1)
data$Response=as.factor(data$Response)

# to check data structure [FINE]
str(data)

# define 20% testing, 80% training data set etc. 
data_set_size=floor(nrow(data)*0.80)
index <- sample(1:nrow(data), size = data_set_size)

# training set up to first set of index 
training <- data[index,]

# testing set up to first set of index 
testing <- data[-index,]

head(testing,100)  #FINE!

# now apply random forest
rf <- randomForest(Response ~ ., data = training, ntree=21, importance=TRUE)
rf
plot(rf)

# testing$DRB1 [FINE]
str(testing)

# predict runs random forest
result <- data.frame(testing$Response, predict(rf, testing[,2], type = "response")) 
result
plot(result)

Here is a scaled-down version of the information I get when running the code:这是我在运行代码时获得的信息的缩小版本：

> library(randomForest)
> #library(xlsx) 
> library(xlsx)
> 
> # load data
> path <- "C:/Users/[...].xlsx"
> data <- read.xlsx(path, sheetIndex = 2)
> 
> # show data
> head(data)
  Response  DRB1
1        - 08_01
2        - 11_01
3        - 07_01
4        - 11_01
5        - 04_04
6        - 07_01
> 
> # make some data categorical
> 
> data$DRB1=as.factor(data$DRB1)
> data$Response=as.factor(data$Response)
> 
> # to check data structure
> str(data)
'data.frame':   27438 obs. of  2 variables:
 $ Response: Factor w/ 2 levels "-","+": 1 1 1 1 1 1 1 1 1 1 ...
 $ DRB1    : Factor w/ 47 levels "0","01_01","01_02",..: 19 26 18 26 10 18 18 19 31 18 ...
> 
> # define 20% testing, 80% training data set etc. 
> data_set_size=floor(nrow(data)*0.80)
> index <- sample(1:nrow(data), size = data_set_size)
> 
> # training set up to first set of index 
> training <- data[index,]
> 
> # testing set up to first set of index 
> testing <- data[-index,]
> 
> head(testing,20)
    Response  DRB1
2          - 11_01
6          - 07_01
9          - 12_01
13         - 03_01
14         - 12_01
27         - 08_01
28         + 04_02
31         - 04_04
33         - 03_01
39         - 14_54
54         - 01_01
55         - 03_01
60         - 04_02
69         - 03_02
81         - 08_01
83         - 11_01
88         - 04_04
90         - 04_07
104        - 15_03
115        - 11_01
> 
> # now apply random forest
> rf <- randomForest(Response ~ ., data = training, ntree=21, importance=TRUE)
> rf

Call:
 randomForest(formula = Response ~ ., data = training, ntree = 21,      importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 21
No. of variables tried at each split: 1

        OOB estimate of  error rate: 20.24%
Confusion matrix:
      - + class.error
- 17504 0           0
+  4443 0           1

Question 1: Why are there no + predictions?问题1：为什么没有+预测？

> plot(rf)
> 
> # testing$DRB1
> 
> str(testing)
'data.frame':   5488 obs. of  2 variables:
 $ Response: Factor w/ 2 levels "-","+": 1 1 1 1 1 1 2 1 1 1 ...
 $ DRB1    : Factor w/ 47 levels "0","01_01","01_02",..: 26 18 31 5 31 19 8 10 5 40 ...

Question2: If DRB1 has some NA data, is this counted as a category like any other and does the randomForest take it into account as a category or ignore it?问题2：如果DRB1有一些NA数据，这是否像其他任何类别一样被视为一个类别，randomForest是否将其视为一个类别或忽略它？

> # predict runs random forest
> result <- data.frame(testing$Response, predict(rf, testing[,2], type = "response")) 
Error in eval(predvars, data, env) : object 'DRB1' not found

Question 3: I have been trying so hard to understand this error message but cannot.问题 3：我一直在努力理解此错误消息，但无法理解。 Can you help me?你能帮助我吗？

> result
     testing.Response predict.rf..testing...2.3...type....response..
7                   -                                              -
11                  -                                              -
13                  -                                              -
16                  -                                              -
24                  -                                              -
36                  -                                              -
56                  -                                              -
58                  -                                              -
59                  -                                              -
65                  -                                              -
72                  -                                              -
73                  -                                              -
75                  -                                              -
77                  -                                              -
87                  -                                              -
89                  -                                              -
95                  -                                              -
101                 -                                              -
108                 -                                              -
111                 +                                              -
118                 -                                              -
123                 -                                              -
124                 -                                              -
126                 -                                              -
127                 -                                              -
130                 -                                              -
143                 -                                              -
149                 -                                              -
154                 -                                              -
155                 -                                              -
167                 -                                              -
169                 -                                              -
174                 -                                              -
175                 -                                              -
177                 -                                              -
199                 -                                              -
201                 -                                              -
202                 -                                              -
213                 -                                              -
220                 -                                              -
229                 -                                              -
231                 +                                              -
256                 -                                              -
259                 -                                              -
268                 -                                              -
278                 -                                              -
281                 -                                              -
291                 -                                              -
296                 +                                              -
297                 +                                              -
298                 -                                              -
301                 +                                              -
302                 -                                              -
303                 +                                              -
[...]
 [ reached 'max' / getOption("max.print") -- omitted 2244 rows ]
> plot(result)

The result plot stays the same as from a previous simulation...结果 plot 与之前的模拟保持相同...

I am SUPER grateful for any help/answers to even just 1 of the 3 questions.我非常感谢对 3 个问题中的 1 个问题的任何帮助/答案。

Thanks a lot !非常感谢！

Answer 1

You have only one variable DRB1 for making the prediction, so most likely random forest is an overkill.您只有一个变量DRB1用于进行预测，因此很可能随机森林是一种过度杀伤力。

I would check by simple plots whether there is any association between categories of DRB1 and your response, for example doing:我会通过简单的图表检查DRB1的类别与您的响应之间是否存在任何关联，例如：

library(ggplot2)
library(dplyr)
data %>% group_by(DRB1) %>% 
count(Response) %>% 
mutate(prop=n/sum(n)) %>% 
ggplot(aes(x=DRB1,y=prop,fill=Response)) + geom_col()

As for NAs.. it is no used in the model, but I don't know how applicable is this given that your data doesn't really need a ML model至于 NAs .. 它在 model 中没有使用，但我不知道这有多适用，因为您的数据并不真正需要 ML model

R 中的奇怪分类和 eval 中的错误（predvars，data，env）：object [...] 未找到错误（没有错字！）

问题描述

1 个解决方案

解决方案1
0 2020-06-26 09:48:03

R 中的奇怪分类和 eval 中的错误（predvars，data，env）：object [...] 未找到错误（没有错字！）

问题描述

1 个解决方案

解决方案1 0 2020-06-26 09:48:03

解决方案1
0 2020-06-26 09:48:03