[英]Random Forest in R: New factor levels not present in the training data
OK, so another newbie question related to the Titanic Competition: 好吧,所以另一个与泰坦尼克号比赛有关的新手问题:
I am trying to run a Random Forest prediction against my test data. 我正在尝试针对我的测试数据进行随机森林预测。 All my work has been done on combined test and training data. 我所有的工作都在组合测试和培训数据上完成。
I have now split the 2 to testdata and trainingdata 我现在将2分为测试数据和训练数据
I have the following code: 我有以下代码:
trainingdata <- droplevels(data.combined[1:891,])
testdata <- droplevels(data.combined[892:1309,])
fitRF <- randomForest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp
+ Parch + Fare + Embarked
+ new.title + family.size + FamilyID2,
data=trainingdata,
importance =T,
ntree=2000)
varImpPlot(fitRF)
#All works up to this point
Prediction <- predict(fitRF, testdata)
#This line above generates error
submit <- data.frame(PassengerID = data.combined$PassengerId, Survived
= Prediction)
write.csv(submit, file="14072017_1_RF", row.names = F)
When I run the Prediction line I get the following error: 当我运行预测行时,出现以下错误:
> Prediction <- predict(fitRF, testdata)
Error in predict.randomForest(fitRF, testdata) :
New factor levels not present in the training data
When i run str(testdata) and str(trainingdata) I can see 2 factors that no longer match 当我运行str(testdata)和str(trainingdata)时,我看到2个不再匹配的因素
Trainingdata
$ Parch : Factor w/ 7 levels
Testdata
$ Parch : Factor w/ 8
Trainingdata
$ FamilyID2 : Factor w/ 22
Testdata
$ FamilyID2 : Factor w/ 18
Is it these differences that are causing my error to occur? 这些差异是否导致我的错误发生? And if so, how do I resolve this? 如果是这样,我该如何解决?
Many Thanks 非常感谢
Additional Information: I have removed Parch and FamilyID2 from the RandomForest creation line, and the code now works, so it is definitely those 2 variables that are causing the issue with mismatched levels. 附加信息:我从RandomForest创建行中删除了Parch和FamilyID2,并且该代码现在可以正常工作,因此肯定是这两个变量导致了级别不匹配的问题。
Fellow newbie here, I was just toying around with Titanic these days. 在这里的新手,这些天我只是在玩《泰坦尼克号》。 I think it doesn´t make sense to have the Parch variable as a factor, so maybe make it numeric and that may solve the problem: 我认为将Parch变量作为一个因素没有任何意义,因此也许将其设为数字即可解决问题:
train$Parch <- as.numeric(train$Parch) train $ Parch <-as.numeric(train $ Parch)
Otherwise, the test data has 2 obs with the value of 9 for Parch, which are not present in the train data: 否则,测试数据的2个obs的Parch值为9,在火车数据中不存在:
> table(train$Parch)
0 1 2 3 4 5 6
678 118 80 5 4 5 1
> table(test$Parch)
0 1 2 3 4 5 6 9
324 52 33 3 2 1 1 2
>
Alternatively, if you need the variable to be a factor, then you could just add another level to it: 或者,如果您需要将该变量作为一个因子,则可以向其添加另一个级别:
train$Parch <- as.factor(train$Parch) # in my data, Parch is type int
train$Parch
levels(train$Parch) <- c(levels(train$Parch), "9")
train$Parch # now Parch has 7 levels
table(train$Parch) # level 9 is empty
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.