简体   繁体   English

如何使用机器学习模型来预测特征略有不同的数据?

[英]How can I use a machine learning model to predict on data whose features differ slightly?

I have a randomForest model trained on a bunch of NLP data (tf-idf values for each word).我有一个在一堆 NLP 数据(每个单词的 tf-idf 值)上训练的 randomForest 模型。 I want to use it to predict on a new dataset.我想用它来预测新数据集。 The features in the model overlap with but don't quite match the features in the new data, such that when I predict on the new data I get:模型中的特征与新数据中的特征重叠但不太匹配,因此当我对新数据进行预测时,我得到:

Error in predict.randomForest(object = model, newdata = new_data) : 
  variables in the training data missing in newdata

I thought to get around this error by excluding all the features from the model which do not appear in the new data, and all the features in the new data which do not appear in the model.我想通过排除模型中未出现在新数据中的所有特征以及新数据中未出现在模型中的所有特征来解决此错误。 Putting aside for the moment the impact on model accuracy (this would significantly pare down the number of features, but there would still be plenty to predict with), I did something like this:暂时撇开对模型准确性的影响(这会显着减少特征的数量,但仍然有很多可以预测的),我做了这样的事情:

model$forest$xlevels <- model$forest$xlevels[colnames(new_data)]
# and vice versa
new_data <- new_data[names(model$forest$xlevels)]

This worked, insofar as names(model$forest$xlevels) == colnames(new_data) returned TRUE for each feature name.这是有效的,因为names(model$forest$xlevels) == colnames(new_data)为每个特征名称返回TRUE

However, when I try to predict on the resulting new_data I still get the variables in the training data missing in newdata error.但是,当我尝试预测生成的new_data我仍然得到variables in the training data missing in newdata错误的variables in the training data missing in newdata I am fairly certain that I'm amending the correct part of the model ( model$forest$xlevels ), so why isn't it working?我相当确定我正在修改模型的正确部分( model$forest$xlevels ),那么为什么它不起作用?

i think you should go the other way around.我认为你应该反过来。 That is add the missing columns to the newdata.那就是将缺失的列添加到新数据中。

When you are working with bags of words, it is common to have words that are not present in some batch of new data.当您处理词袋时,通常会出现一些新数据中不存在的词。 These missing words should just be encoded as a columns of zeros.这些缺失的单词应该被编码为一列零。

# do something like this (also exclude the target variable, obviously)
names_missing <- names(traindata)[!names(traindata) %in% names(new_data)]
new_data[,names_missing] <- 0L

and then you should be able to predict然后你应该能够预测

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM