简体   繁体   English

R:当我根据测试数据进行预测时,将系数作为新水平

[英]R: factor as new level when I predict with test data

I am getting an error from my datasets similar logic with the code I posted in below. 我从数据集中发现了与下面发布的代码类似的逻辑错误。 I have tried increased the number of training data but didn't solve. 我曾尝试增加训练数据的数量,但没有解决。 I have already excluded all NA values. 我已经排除了所有NA值。

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : factor y has new levels L, X model.frame.default(terms,newdata,na.action = na.action,xlev = object $ xlevels)中的错误:因子y具有新的级别L,X

set.seed(234)
d <- data.frame(w=abs(rnorm(50)*1000),
            x=rnorm(50), 
            y=sample(LETTERS[1:26], 50, replace=TRUE))



train_idx <- sample(1:nrow(d), floor(0.8*nrow(d)))
train <- d[train_idx,]
test  <- d[-train_idx,]



fit  <- lm(w ~x + y, data=train)
predict(fit, test)

As @jdobres has already explained the reason of why this error popped up I'll straightforwardly jump to the solution approach: 正如@jdobres已经解释了为什么会出现此错误的原因,我将直接跳到解决方法:

Let's try below line of code just before predict statement 让我们在predict语句之前尝试下面的代码行

#add all levels of 'y' in 'test' dataset to fit$xlevels[["y"]] in the fit object
fit$xlevels[["y"]] <- union(fit$xlevels[["y"]], levels(test[["y"]]))

Hope this would resolve your problem! 希望这能解决您的问题!

Factor and character data are treated as categorical variables. 因子和字符数据被视为分类变量。 As such, models cannot form predictions for category labels they've never seen before. 因此,模型无法为他们从未见过的类别标签形成预测。 If you built a model to predict things about "poodle" and "pit bull", the model would fail if you gave it "golden retriever". 如果您建立了一个模型来预测“贵宾犬”和“斗牛犬”的事情,那么如果给它“金毛猎犬”,该模型将失败。

More specific to your example, the error is telling you that labels "L" and "X", which are in your test set, do not appear in your training set. 更具体地来说,该错误是告诉您测试集中的标签“ L”和“ X”没有出现在训练集中。 Since they weren't in the training set, the model doesn't know what to do when it encounters these in the test. 由于他们不在训练集中,因此模型在测试中遇到这些问题时不知道该怎么办。

Thanks Prem, and if you have many variables you can loop the line of code like this: 感谢Prem,如果您有很多变量,您可以像这样循环代码行:

for(k in vars){
  if(is.factor(shop_data[,k])){
    ols_fit$xlevels[[k]] <- union(ols_fit$xlevels[[k]],levels(shop_data[[k]]))
   }
}

vars are the variables used in the model, shop_data is the main dataset which is split into train and test vars是模型中使用的变量,shop_data是主要数据集,分为训练和测试

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM