简体   繁体   English

R中的随机森林:训练数据中不存在新因子水平

[英]Random Forest in R: New factor levels not present in the training data

OK, so another newbie question related to the Titanic Competition: 好吧,所以另一个与泰坦尼克号比赛有关的新手问题:

I am trying to run a Random Forest prediction against my test data. 我正在尝试针对我的测试数据进行随机森林预测。 All my work has been done on combined test and training data. 我所有的工作都在组合测试和培训数据上完成。

I have now split the 2 to testdata and trainingdata 我现在将2分为测试数据和训练数据

I have the following code: 我有以下代码:

trainingdata <- droplevels(data.combined[1:891,])
testdata <- droplevels(data.combined[892:1309,])

fitRF <- randomForest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp 
+ Parch + Fare + Embarked
                   + new.title + family.size + FamilyID2,
                  data=trainingdata,
                  importance =T,
                  ntree=2000)

varImpPlot(fitRF)

#All works up to this point


Prediction <- predict(fitRF, testdata)
#This line above generates error
submit <- data.frame(PassengerID = data.combined$PassengerId, Survived 
= Prediction)
write.csv(submit, file="14072017_1_RF", row.names = F)

When I run the Prediction line I get the following error: 当我运行预测行时,出现以下错误:

> Prediction <- predict(fitRF, testdata)
Error in predict.randomForest(fitRF, testdata) : 
  New factor levels not present in the training data

When i run str(testdata) and str(trainingdata) I can see 2 factors that no longer match 当我运行str(testdata)和str(trainingdata)时,我看到2个不再匹配的因素

Trainingdata      
$ Parch            : Factor w/ 7 levels 

Testdata
$ Parch            : Factor w/ 8

Trainingdata
$ FamilyID2        : Factor w/ 22 

Testdata
$ FamilyID2        : Factor w/ 18

Is it these differences that are causing my error to occur? 这些差异是否导致我的错误发生? And if so, how do I resolve this? 如果是这样,我该如何解决?

Many Thanks 非常感谢

Additional Information: I have removed Parch and FamilyID2 from the RandomForest creation line, and the code now works, so it is definitely those 2 variables that are causing the issue with mismatched levels. 附加信息:我从RandomForest创建行中删除了Parch和FamilyID2,并且该代码现在可以正常工作,因此肯定是这两个变量导致了级别不匹配的问题。

Fellow newbie here, I was just toying around with Titanic these days. 在这里的新手,这些天我只是在玩《泰坦尼克号》。 I think it doesn´t make sense to have the Parch variable as a factor, so maybe make it numeric and that may solve the problem: 我认为将Parch变量作为一个因素没有任何意义,因此也许将其设为数字​​即可解决问题:

train$Parch <- as.numeric(train$Parch) train $ Parch <-as.numeric(train $ Parch)

Otherwise, the test data has 2 obs with the value of 9 for Parch, which are not present in the train data: 否则,测试数据的2个obs的Parch值为9,在火车数据中不存在:

> table(train$Parch)

0   1   2   3   4   5   6 
678 118  80   5   4   5   1 

> table(test$Parch)

0   1   2   3   4   5   6   9 
324  52  33   3   2   1   1   2 
> 

Alternatively, if you need the variable to be a factor, then you could just add another level to it: 或者,如果您需要将该变量作为一个因子,则可以向其添加另一个级别:

train$Parch <- as.factor(train$Parch) # in my data, Parch is type int
train$Parch
levels(train$Parch) <- c(levels(train$Parch), "9") 
train$Parch # now Parch has 7 levels
table(train$Parch) # level 9 is empty

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 R-随机森林-删除训练数据中不存在的新因子水平 - R - Random Forest - Delete New factor levels not present in the training data 如果测试数据中存在新的因子水平,R 中的随机森林包在预测()期间会显示错误。 有什么办法可以避免这个错误吗? - Random forest package in R shows error during prediction() if there are new factor levels present in test data. Is there any way to avoid this error? 训练数据中不存在新的因子水平 - New factor levels not present in the training data 随机森林:处理R中的因子水平时出错 - random forest: error in dealing with factor levels in R R:使用 PCA 数据训练随机森林 - R: training random forest using PCA data 在R中进行随机森林预测时,将训练数据指定为新数据的效果 - The Effect of Specifying Training Data as New Data when Making Random Forest Predictions in R randomForest()如何预测不在训练数据中的新因子水平? - How does randomForest() predict for new factor levels not in training data? R - 新数据的随机森林预测 - R - Random Forest Prediction on new data R:随机森林回归 model 中的错误训练数据 - R: Error training data in random forest regression model R中的随机森林是否有训练数据大小的限制? - Does random forest in R have a limitation of size of training data?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM