使用混淆矩阵`data`和`reference`的错误应该是具有相同水平的因素

Question

我第一次使用 R（在 R 工作室中），因此对任何愚蠢的错误表示歉意。

我正在运行机器学习 model。 在我的脚本中，我收到以下错误，

Error: `data` and `reference` should be factors with the same levels. 
4. stop("`data` and `reference` should be factors with the same levels.", call. = FALSE) 
3. confusionMatrix.default(Y.pr, Y.ob)

当我踏入confusionMatrix时，我有点困惑。

数据（我的 Y.pr）变量存储在数据部分下，而引用（我的 Y.ob）存储在值下。 当我点击参考它显示

num [1:8593] 0 0 1 1 1 0 0 0 1 1 ...

我展开时的数据变量如下所示。

Large matrix (8593 elements, 604.6 kb)
- attr(*, "dimnames")= List of 2
..$ : chr [1:8593] "34371" "34372" "34373" "34374" ...
..$ : NULL

这些对我来说都没有任何意义。 我猜是 Null 导致了这个问题？

更新

使用相同的数据，我可以在 Python 中运行完全工作的 model

更新结束

Answer 1

我将使用?confusionMatrix中的示例来解决您的错误，然后通过一种方法从中恢复。

预先

这个答案逐步解决问题的方式是将级别分配给非factor变量。 如果您不确定数字水平相对于pred表示什么，那么您的临床研究就结束了：任何结果都是可疑的和站不住脚的。 答案的 rest 假设您确定级别（或者您只是在玩，并且没有正式的研究或调查或任何有关此数据的内容）。 即使原始数据不是factor s，验证“1”和“2”（或任何数字）的含义是关键步骤。

示范

library(caret)
lvs <- c("normal", "abnormal")
truth <- factor(rep(lvs, times = c(86, 258)),
                levels = rev(lvs))
pred <- factor(
  c(
    rep(lvs, times = c(54, 32)),
    rep(lvs, times = c(27, 231))),
  levels = rev(lvs))

head(truth)
# [1] normal normal normal normal normal normal
# Levels: abnormal normal
head(pred)
# [1] normal normal normal normal normal normal
# Levels: abnormal normal

正常（理想）执行：

confusionMatrix(pred, truth)
# Confusion Matrix and Statistics
#           Reference
# Prediction abnormal normal
#   abnormal      231     32
#   normal         27     54
#                                           
#                Accuracy : 0.8285          
#                  95% CI : (0.7844, 0.8668)
#     No Information Rate : 0.75            
#     P-Value [Acc > NIR] : 0.0003097       
#                                           
#                   Kappa : 0.5336          
#  Mcnemar's Test P-Value : 0.6025370       
#                                           
#             Sensitivity : 0.8953          
#             Specificity : 0.6279          
#          Pos Pred Value : 0.8783          
#          Neg Pred Value : 0.6667          
#              Prevalence : 0.7500          
#          Detection Rate : 0.6715          
#    Detection Prevalence : 0.7645          
#       Balanced Accuracy : 0.7616          
#                                           
#        'Positive' Class : abnormal

但是如果第二个参数不是一个因素呢？

truth_num <- as.integer(truth)
head(truth_num)
# [1] 2 2 2 2 2 2
confusionMatrix(pred, truth_num)
# Error: `data` and `reference` should be factors with the same levels.

修复

我们需要做的是将truth_num带回一个因子。

首先，理论：如果它曾经是一个factor并以某种方式转换为integer ，那么它是一堆 1 和 2（最初是其水平的指数）。 如果它从来都不是一个因素，它可能是任何数字，真的，但底线是：我们知道哪个（整数）是哪个（级别）吗？ 如果您猜错了，那么您的测试将给出绝对错误的结果（没有错误/警告）。

table(pred)
# pred
# abnormal   normal 
#      263       81 
table(truth_num)
# truth_num
#   1   2 
# 258  86

仅查看相对比例就表明truth_num的级别应该相同，如c("abnormal", "normal") 。 （但请再次阅读我关于追逐结果的重要说明；不要相信比例，go 返回源数据以找出哪个是哪个。）这就是我们设置它的方式。 go 从索引到因子有几种方法，这里有两种：

### one way
truth_num_fac <- factor(truth_num)
levels(truth_num_fac)
# [1] "1" "2"
head(truth_num_fac)
# [1] 2 2 2 2 2 2
# Levels: 1 2
levels(truth_num_fac) <- levels(pred)
head(truth_num_fac)
# [1] normal normal normal normal normal normal
# Levels: abnormal normal

### another way
dput(head(pred))
# structure(c(2L, 2L, 2L, 2L, 2L, 2L), .Label = c("abnormal", "normal"
# ), class = "factor")
truth_num_fac <- structure(truth_num, .Label = levels(pred), class = "factor")
head(truth_num_fac)
# [1] normal normal normal normal normal normal
# Levels: abnormal normal

无论哪种方式，测试现在都有效。

confusionMatrix(pred, truth_num_fac)
# Confusion Matrix and Statistics
#           Reference
# Prediction abnormal normal
#   abnormal      231     32
#   normal         27     54
#                                           
#                Accuracy : 0.8285          
#                  95% CI : (0.7844, 0.8668)
#     No Information Rate : 0.75            
#     P-Value [Acc > NIR] : 0.0003097       
#                                           
#                   Kappa : 0.5336          
#  Mcnemar's Test P-Value : 0.6025370       
#                                           
#             Sensitivity : 0.8953          
#             Specificity : 0.6279          
#          Pos Pred Value : 0.8783          
#          Neg Pred Value : 0.6667          
#              Prevalence : 0.7500          
#          Detection Rate : 0.6715          
#    Detection Prevalence : 0.7645          
#       Balanced Accuracy : 0.7616          
#                                           
#        'Positive' Class : abnormal        
#

如果...

级别是正确的，但您会看到以下警告：

 confusionMatrix(pred, truth_num_fac) # Warning in confusionMatrix.default(pred, truth_num_fac): # Levels are not in the same order for reference and data. Refactoring data to match. # Confusion Matrix and Statistics ###...

这表明您的级别不是相同的顺序。 修复并不难：

 levels(pred) # [1] "abnormal" "normal" levels(truth_num_fac) # [1] "normal" "abnormal" <---- abnormal should be first, according to pred truth_num_fac <- relevel(truth_num_fac, "abnormal") confusionMatrix(pred, truth_num_fac) # Confusion Matrix and Statistics

水平不正确？ 尽管您的测试结果会完全不同，但您不会收到任何错误或警告； 这并不意味着你应该追求想要的结果，但如果它们大错特错，那就值得关注了：

 ### setup for backwards data truth_num_fac_backwards <- structure(truth_num, .Label = rev(levels(pred)), class = "factor") truth_num_fac_backwards <- relevel(truth_num_fac_backwards, "abnormal") head(truth_num_fac_backwards) # [1] abnormal abnormal abnormal abnormal abnormal abnormal # Levels: abnormal normal confusionMatrix(pred, truth_num_fac_backwards) # Confusion Matrix and Statistics # Reference # Prediction abnormal normal # abnormal 32 231 # normal 54 27 # # Accuracy: 0.1715 <----- OUCH # 95% CI: (0.1332, 0.2156) # No Information Rate: 0.75 # P-Value [Acc > NIR]: 1 # # Kappa: -0.3103 # Mcnemar's Test P-Value: <2e-16 # # Sensitivity: 0.37209 # Specificity: 0.10465 # Pos Pred Value: 0.12167 # Neg Pred Value: 0.33333 # Prevalence: 0.25000 # Detection Rate: 0.09302 # Detection Prevalence: 0.76453 # Balanced Accuracy: 0.23837 # # 'Positive' Class: abnormal #

解决此问题的正确方法是返回 go 并验证哪个级别。 可能是你做对了，结果告诉你事情不是很好的匹配。 任何其他修复将（在我看来）追逐结果：确保您第一次获得正确的数据，不要更改数据以匹配您的预期结果。

我试图将数字向量转换为factor ，但levels(...)返回NULL 。

这可能是因为您的非数字向量不是factor ，而是character 。 这个修复应该很容易：

 ### setup for fake character data pred_chr <- pred pred_chr <- as.character(pred) head(pred_chr) # [1] "normal" "normal" "normal" "normal" "normal" "normal" ### the remedy pred_chr_fac <- factor(pred_chr) head(pred_chr_fac) # [1] normal normal normal normal normal normal # Levels: abnormal normal levels(pred_chr_fac) # [1] "abnormal" "normal"

使用混淆矩阵`data`和`reference`的错误应该是具有相同水平的因素

问题描述

1 个解决方案

解决方案1
1 2020-05-20 15:28:19

预先

示范

修复

如果...

使用混淆矩阵`data`和`reference`的错误应该是具有相同水平的因素

问题描述

1 个解决方案

解决方案1 1 2020-05-20 15:28:19

预先

示范

修复

如果...

解决方案1
1 2020-05-20 15:28:19