简体   繁体   English

当训练集具有比测试集更多不同的因子水平时,randomForest不起作用

[英]randomForest does not work when training set has more different factor levels than test set

When trying to test my trained model on new test data that has fewer factor levels than my training data, predict() returns the following: 当我尝试在比我的训练数据更少的因子级别的新测试数据上测试我训练的模型时, predict()返回以下内容:

Type of predictors in new data do not match that of the training data. 新数据中的预测变量类型与训练数据的类型不匹配。

My training data has a variable with 7 factor levels and my test data has that same variable with 6 factor levels (all 6 ARE in the training data). 我的训练数据有一个具有7个因子水平的变量,我的测试数据具有6个因子水平的相同变量(训练数据中的所有6个ARE)。

When I add an observation containing the "missing" 7th factor, the model runs, so I'm not sure why this happens or even the logic behind it. 当我添加包含“缺失”第7个因子的观察时,模型会运行,所以我不确定为什么会发生这种情况,甚至不知道它背后的逻辑。

I could see if the test set had more/different factor levels, then randomForest would choke, but why in the case where training set has "more" data? 我可以看看测试集是否有更多/不同的因子水平,然后randomForest会窒息,但为什么在训练集有“更多”数据的情况下呢?

R expects both the training and the test data to have the exact same levels (even if one of the sets has no observations for a given level or levels). R期望训练和测试数据具有完全相同的水平(即使其中一组没有对给定水平或水平的观察)。 In your case, since the test dataset is missing a level that the train has, you can do 在您的情况下,由于测试数据集缺少列车所具有的级别,您可以这样做

test$val <- factor(test$val, levels=levels(train$val))

to make sure it has all the same levels and they are coded the same say. 确保它具有所有相同的级别,并且它们的编码相同。

(reposted here to close out the question) (转贴此处以结束问题)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 randomForest()如何预测不在训练数据中的新因子水平? - How does randomForest() predict for new factor levels not in training data? 具有比R中的观察更多级别的因子的数据帧 - Dataframe with a Factor that has More Levels than Observations in R 在测试数据集中使用预测功能时,因子名称具有新的水平 - factor name has new levels while using predict function in test data set 使用Dplyr过滤3个以上级别的因素时出现错误消息 - Error message when using Dplyr to filter with more than 3 levels to a factor 当超过 6 个因子水平时循环通过点形状 - Cycling through point shapes when more than 6 factor levels 在测试集中使用新因子水平进行回归-如何优雅地忽略错误 - Regression with new factor levels in test set - how to gracefully ignore error 训练 SVM 模型时出错:错误:结果中的一个或多个因子水平没有数据:&#39;2&#39; - Error in training SVM model : Error: One or more factor levels in the outcome has no data: '2' R-列出因子2的水平超过2的因子1的水平 - R - Listing the levels of factor 1 that have more than 2 levels of factor 2 在 R 中,如何在具有不同标签的因子中设置和保留自定义级别? - in R, how to set and retain custom levels in factor with different labels? 计算随机森林训练集的AUC的两种不同方法给我不同的结果吗? - Two different ways to calculate the AUC of training set on randomforest give me different results?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM