[英]Strange classification in R and Error in eval(predvars, data, env) : object […] not found error (no typo!)
I am very new to R and have spent hours trying to solve the following problems (which I sense may be interrelated).我对 R 非常陌生,并且花了几个小时试图解决以下问题(我觉得这可能是相互关联的)。 I have read other answers mainly suggesting that there may be a typo in the DRB1 column.我已阅读其他答案,主要表明 DRB1 列中可能存在拼写错误。 This is definitely now the case here and I am wondering if the error is earlier on since the confusion matrix shows that everything is classified as "-" while about 20% of factors in "Response" are "+":现在肯定是这种情况,我想知道错误是否更早,因为混淆矩阵显示所有内容都被归类为“-”,而“响应”中大约 20% 的因素是“+”:
Call:
randomForest(formula = Response ~ ., data = training, ntree = 21, importance = TRUE)
Type of random forest: classification
Number of trees: 21
No. of variables tried at each split: 1
OOB estimate of error rate: 20.24%
Confusion matrix:
- + class.error
- 17504 0 0
+ 4443 0 1
I am trying to use the random forest fn to predict the response variable "Response" by column DRB1:我正在尝试使用随机森林 fn 来预测 DRB1 列的响应变量“Response”:
library(randomForest)
#library(xlsx)
library(xlsx)
# load data
path <- "C:/Users/[...].xlsx" #I have only removed the path for privacy reasons
data <- read.xlsx(path, sheetIndex = 2)
# show data [FINE]
head(data)
# make some data categorical
data$DRB1=as.factor(data$DRB1)
data$Response=as.factor(data$Response)
# to check data structure [FINE]
str(data)
# define 20% testing, 80% training data set etc.
data_set_size=floor(nrow(data)*0.80)
index <- sample(1:nrow(data), size = data_set_size)
# training set up to first set of index
training <- data[index,]
# testing set up to first set of index
testing <- data[-index,]
head(testing,100) #FINE!
# now apply random forest
rf <- randomForest(Response ~ ., data = training, ntree=21, importance=TRUE)
rf
plot(rf)
# testing$DRB1 [FINE]
str(testing)
# predict runs random forest
result <- data.frame(testing$Response, predict(rf, testing[,2], type = "response"))
result
plot(result)
Here is a scaled-down version of the information I get when running the code:这是我在运行代码时获得的信息的缩小版本:
> library(randomForest)
> #library(xlsx)
> library(xlsx)
>
> # load data
> path <- "C:/Users/[...].xlsx"
> data <- read.xlsx(path, sheetIndex = 2)
>
> # show data
> head(data)
Response DRB1
1 - 08_01
2 - 11_01
3 - 07_01
4 - 11_01
5 - 04_04
6 - 07_01
>
> # make some data categorical
>
> data$DRB1=as.factor(data$DRB1)
> data$Response=as.factor(data$Response)
>
> # to check data structure
> str(data)
'data.frame': 27438 obs. of 2 variables:
$ Response: Factor w/ 2 levels "-","+": 1 1 1 1 1 1 1 1 1 1 ...
$ DRB1 : Factor w/ 47 levels "0","01_01","01_02",..: 19 26 18 26 10 18 18 19 31 18 ...
>
> # define 20% testing, 80% training data set etc.
> data_set_size=floor(nrow(data)*0.80)
> index <- sample(1:nrow(data), size = data_set_size)
>
> # training set up to first set of index
> training <- data[index,]
>
> # testing set up to first set of index
> testing <- data[-index,]
>
> head(testing,20)
Response DRB1
2 - 11_01
6 - 07_01
9 - 12_01
13 - 03_01
14 - 12_01
27 - 08_01
28 + 04_02
31 - 04_04
33 - 03_01
39 - 14_54
54 - 01_01
55 - 03_01
60 - 04_02
69 - 03_02
81 - 08_01
83 - 11_01
88 - 04_04
90 - 04_07
104 - 15_03
115 - 11_01
>
> # now apply random forest
> rf <- randomForest(Response ~ ., data = training, ntree=21, importance=TRUE)
> rf
Call:
randomForest(formula = Response ~ ., data = training, ntree = 21, importance = TRUE)
Type of random forest: classification
Number of trees: 21
No. of variables tried at each split: 1
OOB estimate of error rate: 20.24%
Confusion matrix:
- + class.error
- 17504 0 0
+ 4443 0 1
Question 1: Why are there no + predictions?问题1:为什么没有+预测?
> plot(rf)
>
> # testing$DRB1
>
> str(testing)
'data.frame': 5488 obs. of 2 variables:
$ Response: Factor w/ 2 levels "-","+": 1 1 1 1 1 1 2 1 1 1 ...
$ DRB1 : Factor w/ 47 levels "0","01_01","01_02",..: 26 18 31 5 31 19 8 10 5 40 ...
Question2: If DRB1 has some NA data, is this counted as a category like any other and does the randomForest take it into account as a category or ignore it?问题2:如果DRB1有一些NA数据,这是否像其他任何类别一样被视为一个类别,randomForest是否将其视为一个类别或忽略它?
> # predict runs random forest
> result <- data.frame(testing$Response, predict(rf, testing[,2], type = "response"))
Error in eval(predvars, data, env) : object 'DRB1' not found
Question 3: I have been trying so hard to understand this error message but cannot.问题 3:我一直在努力理解此错误消息,但无法理解。 Can you help me?你能帮助我吗?
> result
testing.Response predict.rf..testing...2.3...type....response..
7 - -
11 - -
13 - -
16 - -
24 - -
36 - -
56 - -
58 - -
59 - -
65 - -
72 - -
73 - -
75 - -
77 - -
87 - -
89 - -
95 - -
101 - -
108 - -
111 + -
118 - -
123 - -
124 - -
126 - -
127 - -
130 - -
143 - -
149 - -
154 - -
155 - -
167 - -
169 - -
174 - -
175 - -
177 - -
199 - -
201 - -
202 - -
213 - -
220 - -
229 - -
231 + -
256 - -
259 - -
268 - -
278 - -
281 - -
291 - -
296 + -
297 + -
298 - -
301 + -
302 - -
303 + -
[...]
[ reached 'max' / getOption("max.print") -- omitted 2244 rows ]
> plot(result)
The result plot stays the same as from a previous simulation...结果 plot 与之前的模拟保持相同...
I am SUPER grateful for any help/answers to even just 1 of the 3 questions.我非常感谢对 3 个问题中的 1 个问题的任何帮助/答案。
Thanks a lot !非常感谢!
You have only one variable DRB1
for making the prediction, so most likely random forest is an overkill.您只有一个变量DRB1
用于进行预测,因此很可能随机森林是一种过度杀伤力。
I would check by simple plots whether there is any association between categories of DRB1
and your response, for example doing:我会通过简单的图表检查DRB1
的类别与您的响应之间是否存在任何关联,例如:
library(ggplot2)
library(dplyr)
data %>% group_by(DRB1) %>%
count(Response) %>%
mutate(prop=n/sum(n)) %>%
ggplot(aes(x=DRB1,y=prop,fill=Response)) + geom_col()
As for NAs.. it is no used in the model, but I don't know how applicable is this given that your data doesn't really need a ML model至于 NAs .. 它在 model 中没有使用,但我不知道这有多适用,因为您的数据并不真正需要 ML model
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.