[英]randomForest does not work when training set has more different factor levels than test set
When trying to test my trained model on new test data that has fewer factor levels than my training data, predict()
returns the following: 当我尝试在比我的训练数据更少的因子级别的新测试数据上测试我训练的模型时, predict()
返回以下内容:
Type of predictors in new data do not match that of the training data. 新数据中的预测变量类型与训练数据的类型不匹配。
My training data has a variable with 7 factor levels and my test data has that same variable with 6 factor levels (all 6 ARE in the training data). 我的训练数据有一个具有7个因子水平的变量,我的测试数据具有6个因子水平的相同变量(训练数据中的所有6个ARE)。
When I add an observation containing the "missing" 7th factor, the model runs, so I'm not sure why this happens or even the logic behind it. 当我添加包含“缺失”第7个因子的观察时,模型会运行,所以我不确定为什么会发生这种情况,甚至不知道它背后的逻辑。
I could see if the test set had more/different factor levels, then randomForest would choke, but why in the case where training set has "more" data? 我可以看看测试集是否有更多/不同的因子水平,然后randomForest会窒息,但为什么在训练集有“更多”数据的情况下呢?
R expects both the training and the test data to have the exact same levels (even if one of the sets has no observations for a given level or levels). R期望训练和测试数据具有完全相同的水平(即使其中一组没有对给定水平或水平的观察)。 In your case, since the test dataset is missing a level that the train has, you can do 在您的情况下,由于测试数据集缺少列车所具有的级别,您可以这样做
test$val <- factor(test$val, levels=levels(train$val))
to make sure it has all the same levels and they are coded the same say. 确保它具有所有相同的级别,并且它们的编码相同。
(reposted here to close out the question) (转贴此处以结束问题)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.