當我嘗試預測R-caret中的類概率時出錯

Question

我用插入符號構建了一個模型。 培訓結束后，我收到以下警告：

警告消息：在train.default（x，y，weights = w，...）中：至少有一個類級別不是有效的R變量名稱; 如果生成類概率，這可能會導致錯誤，因為變量名稱將轉換為：X0，X1

變量的名稱是：

      str(train)
'data.frame':   7395 obs. of  30 variables:
 $ alchemy_category              : Factor w/ 13 levels "arts_entertainment",..: 2 8 6 6 11 6 1 6 3 8 ...
 $ alchemy_category_score        : num  3737 2052 4801 3816 3179 ...
 $ avglinksize                   : num  2.06 3.68 2.38 1.54 2.68 ...
 $ commonlinkratio_1             : num  0.676 0.508 0.562 0.4 0.5 ...
 $ commonlinkratio_2             : num  0.206 0.289 0.322 0.1 0.222 ...
 $ commonlinkratio_3             : num  0.0471 0.2139 0.1202 0.0167 0.1235 ...
 $ commonlinkratio_4             : num  0.0235 0.1444 0.0426 0 0.0432 ...
 $ compression_ratio             : num  0.444 0.469 0.525 0.481 0.446 ...
 $ embed_ratio                   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ frameTagRatio                 : num  0.0908 0.0987 0.0724 0.0959 0.0249 ...
 $ hasDomainLink                 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ html_ratio                    : num  0.246 0.203 0.226 0.266 0.229 ...
 $ image_ratio                   : num  0.00388 0.08865 0.12054 0.03534 0.05047 ...
 $ is_news                       : Factor w/ 2 levels "0","1": 2 2 2 2 2 1 2 1 2 1 ...
 $ lengthyLinkDomain             : Factor w/ 2 levels "0","1": 2 2 2 1 2 1 1 1 1 2 ...
 $ linkwordscore                 : num  24 40 55 24 14 12 21 5 17 14 ...
 $ news_front_page               : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ non_markup_alphanum_characters: num  5424 4973 2240 2737 12032 ...
 $ numberOfLinks                 : num  170 187 258 120 162 55 93 132 194 326 ...
 $ numwords_in_url               : num  8 9 11 5 10 3 3 4 7 4 ...
 $ parametrizedLinkRatio         : num  0.1529 0.1818 0.1667 0.0417 0.0988 ...
 $ spelling_errors_ratio         : num  0.0791 0.1254 0.0576 0.1009 0.0826 ...
 $ label                         : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 2 1 2 2 ...
 $ isVideo                       : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 1 1 ...
 $ isFashion                     : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 2 1 2 1 ...
 $ isFood                        : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
 $ hasComments                   : Factor w/ 2 levels "0","1": 1 2 2 2 2 1 2 2 1 2 ...
 $ hasGoogleAnalytics            : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 2 2 2 1 ...
 $ hasInlineCSS                  : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 2 1 2 2 ...
 $ noOfMetaTags                  : num  10 12 6 10 13 2 6 6 9 5 ...

我的代碼如下：

ctrl <- trainControl(method = "CV",
                     number=10,
                     classProbs = TRUE,
                     allowParallel = TRUE,
                     summaryFunction = twoClassSummary)

set.seed(476)
rfFit <- train(formula,
               data=train,
               method = "rf",
               tuneGrid = expand.grid(.mtry = seq(4,20,by=2)),
               ntrees=1000,
               importance = TRUE,
               metric = "ROC",
               trControl = ctrl)


pred <- predict.train(rfFit, newdata = test, type = "prob")

我收到錯誤： [.data.frame （out ,, obsLevels，drop = FALSE）出錯： 選擇了未定義的列

測試數據集上的變量名稱為：

str(test)
'data.frame':   3171 obs. of  29 variables:
 $ alchemy_category              : Factor w/ 13 levels "arts_entertainment",..: 8 4 12 4 10 12 12 8 1 2 ...
 $ alchemy_category_score        : num  5307 4825 1 6708 5416 ...
 $ avglinksize                   : num  2.56 3.77 2.27 2.52 1.85 ...
 $ commonlinkratio_1             : num  0.39 0.462 0.496 0.706 0.471 ...
 $ commonlinkratio_2             : num  0.257 0.205 0.385 0.346 0.161 ...
 $ commonlinkratio_3             : num  0.0441 0.0513 0.1709 0.123 0.0323 ...
 $ commonlinkratio_4             : num  0.0221 0 0.1709 0.0906 0 ...
 $ compression_ratio             : num  0.49 0.782 1.25 0.449 0.454 ...
 $ embed_ratio                   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ frameTagRatio                 : num  0.0671 0.0429 0.0588 0.0581 0.093 ...
 $ hasDomainLink                 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ html_ratio                    : num  0.23 0.366 0.162 0.147 0.244 ...
 $ image_ratio                   : num  0.19944 0.08 10 0.00596 0.03571 ...
 $ is_news                       : Factor w/ 2 levels "0","1": 2 1 1 2 2 1 1 2 1 1 ...
 $ lengthyLinkDomain             : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 1 1 1 ...
 $ linkwordscore                 : num  15 62 42 41 34 35 15 22 41 7 ...
 $ news_front_page               : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ non_markup_alphanum_characters: num  5643 382 2420 5559 2209 ...
 $ numberOfLinks                 : num  136 39 117 309 155 266 55 145 110 1 ...
 $ numwords_in_url               : num  3 2 1 10 10 7 1 9 5 0 ...
 $ parametrizedLinkRatio         : num  0.2426 0.1282 0.5812 0.0388 0.0968 ...
 $ spelling_errors_ratio         : num  0.0806 0.1765 0.125 0.0631 0.0653 ...
 $ isVideo                       : Factor w/ 2 levels "0","1": 1 2 1 2 2 2 1 1 2 2 ...
 $ isFashion                     : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 1 ...
 $ isFood                        : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
 $ hasComments                   : Factor w/ 2 levels "0","1": 2 1 1 2 2 2 1 2 2 1 ...
 $ hasGoogleAnalytics            : Factor w/ 2 levels "0","1": 1 2 2 2 2 1 1 2 1 1 ...
 $ hasInlineCSS                  : Factor w/ 2 levels "0","1": 2 2 2 1 1 2 2 2 1 1 ...
 $ noOfMetaTags                  : num  3 6 5 9 16 22 6 9 7 0 ...

如果我省略type =“prob”部分，我沒有錯誤。

有任何想法嗎？

可能是變量“alchemy_category”的長度，它附加了相應的因子級別，例如模型中的“alchemy_categoryarts_entertainment”？

Answer 1

你的帖子頂部的答案是粗體=]

你在做什么造型？ 是alchemy_category嗎？ 代碼只說formula ，我們看不到它。

當您要求類概率時，模型預測是一個數據框，每個類/級別都有單獨的列。 如果alchemy_category沒有有效列名的級別，則data.frame轉換為有效名稱。 這會產生問題，因為代碼正在查找特定名稱，但數據框是一個不同（但有效）的名稱。

例如，如果我有

> test <- factor(c("level1", "level 2")) 
> levels(test)
[1] "level 2" "level1" 
> make.names(levels(test))
[1] "level.2" "level1"

代碼將尋找“級別2”，但只有“level.2”。

Answer 2

如上所述，類值必須是因子，並且必須是有效名稱。 確保這一點的另一種方法是，

levels(all.dat$target) <- make.names(levels(factor(all.dat$target)))

Answer 3

我在面對類似的問題時已經閱讀了上面的答案。 一個正式的解決方案是在列車和測試數據集上執行此操作。 確保在feature.names中也包含響應變量。

feature.names=names(train)

for (f in feature.names) {
  if (class(train[[f]])=="factor") {
    levels <- unique(c(train[[f]]))
    train[[f]] <- factor(train[[f]],
                   labels=make.names(levels))
  }
}

這為所有因素創建了語法正確的標簽。

Answer 4

根據上面的例子，通常重構結果變量將解決問題。 在划分為訓練和測試數據集之前，最好更改原始數據集

level < - unique（data $ outcome）data $ outcome < - factor（data $ outcome，labels = make.names（levels））

正如其他人之前指出的那樣，這個問題只發生在classProbs = TRUE時，導致列車功能產生與結果類相關的其他統計數據

Answer 5

正如@Sam Firke在評論中已經指出的那樣（但我忽略了它）等級TRUE / FALSE也不起作用。 所以我將它們轉換為是/否。

當我嘗試預測R-caret中的類概率時出錯

問題描述

5 個解決方案

解決方案1
40 2013-09-02 23:31:01

解決方案2
15 2016-04-12 12:39:56

解決方案3
10 2016-01-17 01:04:59

解決方案4
0 2016-04-10 01:35:55

解決方案5
0 2017-08-29 09:25:25

當我嘗試預測R-caret中的類概率時出錯

問題描述

5 個解決方案

解決方案1 40 2013-09-02 23:31:01

解決方案2 15 2016-04-12 12:39:56

解決方案3 10 2016-01-17 01:04:59

解決方案4 0 2016-04-10 01:35:55

解決方案5 0 2017-08-29 09:25:25

解決方案1
40 2013-09-02 23:31:01

解決方案2
15 2016-04-12 12:39:56

解決方案3
10 2016-01-17 01:04:59

解決方案4
0 2016-04-10 01:35:55

解決方案5
0 2017-08-29 09:25:25