簡體   English   中英

決策樹派對包預測錯誤 - 級別不匹配

[英]Decision Tree party package prediction error - Levels do not match

我正在使用 party 包在 R 中構建 CART 回歸樹模型,但是當我嘗試將模型與測試數據集應用時,我收到錯誤消息說級別不匹配。

過去一周我一直在閱讀論壇上的帖子,但仍然找不到解決我問題的正確方法。 所以我在這里使用我編造的假例子重新發布這個問題..有人可以幫助解釋錯誤消息並提供解決方案嗎?

我的訓練數據集大約有 1000 條記錄,測試數據集大約有 150 條記錄。兩個數據集中都沒有 NA 或空白字段。

我在party包下使用ctree的CART模型是:

mytree<- ctree(Rate~Bank+Product+Salary, data=data_train)

data_train 示例:

Rate  Bank  Product  Salary    
1.5    A     aaa     100000
0.6    B     abc      60000
3      C     bac      10000
2.1    D     cba      50000
1.1    E     cca      80000

數據測試示例:

Rate  Bank  Product   Salary
2.0    A     cba       80000
0.5    D     cca      250000
0.8    E     cba      120000
2.1    C     abc       65000

levels(data_train$Bank) : A, B, C, D, E

levels(data_test$Bank): A,D,E,C

我嘗試使用以下代碼設置為相同級別:

>is.factor(data_test$Bank)

 TRUE 
(Made sure Bank and Products are factors in both datasets)
>levels(data_test$Bank) <-union(levels(data_test$Bank), levels(data_train$Bank))

> levels(data_test$product)<-union(levels(data_test$product),levels(data_train$product))

但是,當我嘗試對測試數據集運行預測時,出現以下錯誤:

> fit1<- predict(mytree,newdata=data_test)

Error in checkData(oldData, RET) : 
  Levels in factors of new data do not match original data

我也嘗試了以下方法,但它改變了我的測試數據集的字段...:

水平(數據測試$銀行)<-水平(數據火車$銀行)

data_test 表被改變:

Rate  Bank(altered)  Bank (original)   
2.0    A              A      
0.5    B              D      
0.8    C              E      
2.1    D              C       

您可以嘗試使用可比較的級別重建因子,而不是為現有因子分配新級別。 下面是一個例子:

# start the party
library(party)

# create training data sample
data_train <- data.frame(Rate = c(1.5, 0.6, 3, 2.1, 1.1),
                         Bank = c("A", "B", "C", "D", "E"),
                         Product = c("aaa", "abc", "bac", "cba", "cca"),
                         Salary = c(100000, 60000, 10000, 50000, 80000))

# create testing data sample
data_test <- data.frame(Rate = c(2.0, 0.5, 0.8, 2.1),
                         Bank = c("A", "D", "E", "C"),
                         Product = c("cba", "cca", "cba", "abc"),
                         Salary = c(80000, 250000, 120000, 65000))

# get the union of levels between train and test for Bank and Product
bank_levels <- union(levels(data_test$Bank), levels(data_train$Bank))
product_levels <- union(levels(data_test$Product), levels(data_train$Product))

# rebuild Bank with union of levels
data_test$Bank <- with(data_test, factor(Bank, levels = bank_levels)) 
data_train$Bank <- with(data_train, factor(Bank, levels = bank_levels)) 

# rebuild Product with union of levels
data_test$Product <- with(data_test, factor(Product, levels = product_levels)) 
data_train$Product <- with(data_train, factor(Product, levels = product_levels)) 

# fit the model
mytree <- ctree(Rate ~ Bank + Product + Salary, data = data_train)

# generate predictions
fit1 <- predict(mytree, newdata = data_test)

> fit1
     Rate
[1,] 1.66
[2,] 1.66
[3,] 1.66
[4,] 1.66

我使用的是 ctree 的例子,但這基本上是巧妙地使用因子,因此可以用於任何嚴格依賴因子水平的算法(RandomForest 等)

這完全是關於了解 R 如何存儲和使用因子水平。 如果我們使用在訓練數據中使用的相同因子水平(並且以相同的順序)(是的,即使沒有使用測試數據),我們使用預訓練的 ctree 模型進行預測。

實際上,使用 ctree (party) 包進行預測不需要俱樂部訓練和測試數據。 這是因為當您使用預先訓練的模型時,您在運行時生產期間可能沒有那么奢侈的內存和處理器能力。 預訓練模型減輕了我們在生產環境中在大量訓練數據上構建模型的負擔。

第 1 步:在構建模型時,您可以將每一列的因子級別存儲在訓練數據中(只要適用)

var_list <- colnames(dtrain)
for(var in var_list)
{
  if(class(dtrain[,var]) == 'character')
  {
    print(var)

    #Fill blanks with "None" to keep the factor levels consistent
    dtrain[dtrain[,var] == '',var] <- 'None'

    col_name_levels <- unique(dtrain[,var])

    #Make sure you have sorted the column levels     
    col_name_levels <- sort(col_name_levels, decreasing = FALSE)

    #Make as factors
    dtrain[,var] <- factor(dtrain[,var], levels = col_name_levels, ordered=TRUE)

    print(levels(dtrain[,var]))

    #This is the trick: Store the exact levels in a CSV which is much easier to load than the whole train data later in prediction phase    
    write.csv(levels(dtrain[,var]), paste0(getwd(),'/Output CSVs/',var,'_levels.csv'), row.names = FALSE)
  }
}


# also store the column names and data types for detecting later
for(col_name in colnames(dtrain))
{
  abc <- data.frame('col_name' = col_name,'class_colname' = paste(class(dtrain[,col_name]), collapse = ' '))

  if(!exists('col_name_type_list'))
  {
    col_name_type_list <- abc
  }else
  {
    col_name_type_list <- rbind(col_name_type_list, abc)
  }
}

#Store for checking later
write.csv(col_name_type_list, filepath, row.names = FALSE)

然后在預測階段(在生產環境中),只需讀取測試數據中每一列的級別,丟棄具有新數據的行(無論如何 ctree 將無法預測它們),然后使用這些行進行預測。

###############Now in test prediction ###########################


#Read the column list of train data (stored earlier)
col_name_type_list_dtrain <- read.csv( filepath, header = TRUE)


for(i in 1:nrow(col_name_type_list_dtrain))
{
  col_name <- col_name_type_list_dtrain[i,]$col_name
  class_colname <- col_name_type_list_dtrain[i,]$class_colname

  if(class_colname == 'numeric')
  {
    dtest[,col_name] <- as.numeric(dtest[,col_name])
  }

  if(class_colname == 'ordered factor')
  {

    #Now use the column factor levels from train
    remove(col_name_levels)
    col_name_levels <- read.csv( paste0(getwd(),'/Output CSVs/',var,'_levels.csv'), header = TRUE)
    factor_check_flag <- TRUE

    col_name_levels <- as.character(col_name_levels$x)
    print(col_name)
    print('Pre-Existing levels detected')
    print(NROW(col_name_levels))

    #Drop new rows which are not in train; the model cant predict for them
    rows_before_dropping <- nrow(dtest)
    print('Adjusting levels to train......')
    dtest <- dtest[dtest[,col_name] %in% col_name_levels,]
    rows_after_dropping <- nrow(dtest)

    cat('\nDropped Rows for adjusting ',col_name,': ',(rows_before_dropping - rows_after_dropping),'\n')

    #Convert to factors
    dtest[,col_name] <- factor(dtest[,col_name], levels=col_name_levels, ordered=TRUE)

    print(dtest[,col_name])
  }
}

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM