简体   繁体   English

决策树派对包预测错误 - 级别不匹配

[英]Decision Tree party package prediction error - Levels do not match

I am building a CART regression tree model in R using party package, but I got error message saying levels do not match when I try to apply the model with testing dataset.我正在使用 party 包在 R 中构建 CART 回归树模型,但是当我尝试将模型与测试数据集应用时,我收到错误消息说级别不匹配。

I have spent the past week reading through the threads on the forum, but still couldn't find the right solution to my problem.过去一周我一直在阅读论坛上的帖子,但仍然找不到解决我问题的正确方法。 So I am reposting this question here using fake examples I made up.. Can someone help explain the error message and provide a solution?所以我在这里使用我编造的假例子重新发布这个问题..有人可以帮助解释错误消息并提供解决方案吗?

my training dataset has about 1000 records and testing dataset has about 150. There's no NA or blank fields in either dataset.我的训练数据集大约有 1000 条记录,测试数据集大约有 150 条记录。两个数据集中都没有 NA 或空白字段。

my CART model using ctree under the party package is:我在party包下使用ctree的CART模型是:

mytree<- ctree(Rate~Bank+Product+Salary, data=data_train) mytree<- ctree(Rate~Bank+Product+Salary, data=data_train)

data_train example: data_train 示例:

Rate  Bank  Product  Salary    
1.5    A     aaa     100000
0.6    B     abc      60000
3      C     bac      10000
2.1    D     cba      50000
1.1    E     cca      80000

data_test example:数据测试示例:

Rate  Bank  Product   Salary
2.0    A     cba       80000
0.5    D     cca      250000
0.8    E     cba      120000
2.1    C     abc       65000

levels(data_train$Bank) : A, B, C, D, E

levels(data_test$Bank): A,D,E,C

I tried to set to the same level using the following codes:我尝试使用以下代码设置为相同级别:

>is.factor(data_test$Bank)

 TRUE 
(Made sure Bank and Products are factors in both datasets)
>levels(data_test$Bank) <-union(levels(data_test$Bank), levels(data_train$Bank))

> levels(data_test$product)<-union(levels(data_test$product),levels(data_train$product))

However, when I try to run prediction on the testing dataset, I get the following error:但是,当我尝试对测试数据集运行预测时,出现以下错误:

> fit1<- predict(mytree,newdata=data_test)

Error in checkData(oldData, RET) : 
  Levels in factors of new data do not match original data

I have also tried the following method but it alters the fields of my testing dataset...:我也尝试了以下方法,但它改变了我的测试数据集的字段...:

levels(data_test$Bank) <-levels(data_train$Bank)水平(数据测试$银行)<-水平(数据火车$银行)

The data_test table is altered: data_test 表被改变:

Rate  Bank(altered)  Bank (original)   
2.0    A              A      
0.5    B              D      
0.8    C              E      
2.1    D              C       

You might try rebuilding your factors using comparable levels instead of assigning new levels to existing factors.您可以尝试使用可比较的级别重建因子,而不是为现有因子分配新级别。 Here's an example:下面是一个例子:

# start the party
library(party)

# create training data sample
data_train <- data.frame(Rate = c(1.5, 0.6, 3, 2.1, 1.1),
                         Bank = c("A", "B", "C", "D", "E"),
                         Product = c("aaa", "abc", "bac", "cba", "cca"),
                         Salary = c(100000, 60000, 10000, 50000, 80000))

# create testing data sample
data_test <- data.frame(Rate = c(2.0, 0.5, 0.8, 2.1),
                         Bank = c("A", "D", "E", "C"),
                         Product = c("cba", "cca", "cba", "abc"),
                         Salary = c(80000, 250000, 120000, 65000))

# get the union of levels between train and test for Bank and Product
bank_levels <- union(levels(data_test$Bank), levels(data_train$Bank))
product_levels <- union(levels(data_test$Product), levels(data_train$Product))

# rebuild Bank with union of levels
data_test$Bank <- with(data_test, factor(Bank, levels = bank_levels)) 
data_train$Bank <- with(data_train, factor(Bank, levels = bank_levels)) 

# rebuild Product with union of levels
data_test$Product <- with(data_test, factor(Product, levels = product_levels)) 
data_train$Product <- with(data_train, factor(Product, levels = product_levels)) 

# fit the model
mytree <- ctree(Rate ~ Bank + Product + Salary, data = data_train)

# generate predictions
fit1 <- predict(mytree, newdata = data_test)

> fit1
     Rate
[1,] 1.66
[2,] 1.66
[3,] 1.66
[4,] 1.66

I am using the example of ctree but this is basically using factors smartly and hence can be used in any algorithm (RandomForest etc) which strictly rely on factor levels我使用的是 ctree 的例子,但这基本上是巧妙地使用因子,因此可以用于任何严格依赖因子水平的算法(RandomForest 等)

This is all about understanding how R stores and uses factor levels.这完全是关于了解 R 如何存储和使用因子水平。 If we use the same factor levels(AND IN THE SAME ORDER) used in train data (yes even without clubbing with test data), we do the prediction using pre-trained ctree models.如果我们使用在训练数据中使用的相同因子水平(并且以相同的顺序)(是的,即使没有使用测试数据),我们使用预训练的 ctree 模型进行预测。

Actually there is no need to club train and test data for predicting using ctree (party) package.实际上,使用 ctree (party) 包进行预测不需要俱乐部训练和测试数据。 This is because you may not have that luxury of memory and processor power during run time production when you are using pre-trained models.这是因为当您使用预先训练的模型时,您在运行时生产期间可能没有那么奢侈的内存和处理器能力。 Pre-trained models relieve us from the burden of building models on huge training data in production environment.预训练模型减轻了我们在生产环境中在大量训练数据上构建模型的负担。

Step 1: While building the model you can store the factor levels for each column in train data (wherever it is applicable)第 1 步:在构建模型时,您可以将每一列的因子级别存储在训练数据中(只要适用)

var_list <- colnames(dtrain)
for(var in var_list)
{
  if(class(dtrain[,var]) == 'character')
  {
    print(var)

    #Fill blanks with "None" to keep the factor levels consistent
    dtrain[dtrain[,var] == '',var] <- 'None'

    col_name_levels <- unique(dtrain[,var])

    #Make sure you have sorted the column levels     
    col_name_levels <- sort(col_name_levels, decreasing = FALSE)

    #Make as factors
    dtrain[,var] <- factor(dtrain[,var], levels = col_name_levels, ordered=TRUE)

    print(levels(dtrain[,var]))

    #This is the trick: Store the exact levels in a CSV which is much easier to load than the whole train data later in prediction phase    
    write.csv(levels(dtrain[,var]), paste0(getwd(),'/Output CSVs/',var,'_levels.csv'), row.names = FALSE)
  }
}


# also store the column names and data types for detecting later
for(col_name in colnames(dtrain))
{
  abc <- data.frame('col_name' = col_name,'class_colname' = paste(class(dtrain[,col_name]), collapse = ' '))

  if(!exists('col_name_type_list'))
  {
    col_name_type_list <- abc
  }else
  {
    col_name_type_list <- rbind(col_name_type_list, abc)
  }
}

#Store for checking later
write.csv(col_name_type_list, filepath, row.names = FALSE)

And then in prediction phase (in production environment), just read those levels for each column in test data, discard the rows which have new data (ctree will not be able to predict for them anyways) and then use the rows for prediction.然后在预测阶段(在生产环境中),只需读取测试数据中每一列的级别,丢弃具有新数据的行(无论如何 ctree 将无法预测它们),然后使用这些行进行预测。

###############Now in test prediction ###########################


#Read the column list of train data (stored earlier)
col_name_type_list_dtrain <- read.csv( filepath, header = TRUE)


for(i in 1:nrow(col_name_type_list_dtrain))
{
  col_name <- col_name_type_list_dtrain[i,]$col_name
  class_colname <- col_name_type_list_dtrain[i,]$class_colname

  if(class_colname == 'numeric')
  {
    dtest[,col_name] <- as.numeric(dtest[,col_name])
  }

  if(class_colname == 'ordered factor')
  {

    #Now use the column factor levels from train
    remove(col_name_levels)
    col_name_levels <- read.csv( paste0(getwd(),'/Output CSVs/',var,'_levels.csv'), header = TRUE)
    factor_check_flag <- TRUE

    col_name_levels <- as.character(col_name_levels$x)
    print(col_name)
    print('Pre-Existing levels detected')
    print(NROW(col_name_levels))

    #Drop new rows which are not in train; the model cant predict for them
    rows_before_dropping <- nrow(dtest)
    print('Adjusting levels to train......')
    dtest <- dtest[dtest[,col_name] %in% col_name_levels,]
    rows_after_dropping <- nrow(dtest)

    cat('\nDropped Rows for adjusting ',col_name,': ',(rows_before_dropping - rows_after_dropping),'\n')

    #Convert to factors
    dtest[,col_name] <- factor(dtest[,col_name], levels=col_name_levels, ordered=TRUE)

    print(dtest[,col_name])
  }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM