有没有更快的方法在 R 中制作这个混淆矩阵表？

Question

我正在尝试使用以下 dataframe 在 R 中制作混淆矩阵表：

mydf <- structure(list(pred_class = c("dog", "dog", "fish", "cat", "cat", 
"dog", "fish", "cat", "dog", "fish"), true_class = c("cat", "cat", 
"dog", "cat", "cat", "dog", "dog", "cat", "dog", "fish")), row.names = c(NA, 
10L), class = "data.frame")

  pred_class true_class
1        dog        cat
2        dog        cat
3       fish        dog
4        cat        cat
5        cat        cat
6        dog        dog

我已经生成了代码来做我想做的事——对于每个 class（狗、猫或鱼），说每一行是真阳性、假阳性、真阴性还是假阴性。

conf_mat <- mydf %>%
    mutate(
        dog_conf = case_when(
            true_class == "dog" &  pred_class == "dog" ~ "TP",
            true_class == "dog" &  pred_class != "dog" ~ "FN",
            true_class != "dog" &  pred_class == "dog" ~ "FP",
            true_class != "dog" &  pred_class != "dog" ~ "TN"
        ),
        cat_conf = case_when(
            true_class == "cat" &  pred_class == "cat" ~ "TP",
            true_class == "cat" &  pred_class != "cat" ~ "FN",
            true_class != "cat" &  pred_class == "cat" ~ "FP",
            true_class != "cat" &  pred_class != "cat" ~ "TN"
        ),
        fish_conf = case_when(
            true_class == "fish" &  pred_class == "fish" ~ "TP",
            true_class == "fish" &  pred_class != "fish" ~ "FN",
            true_class != "fish" &  pred_class == "fish" ~ "FP",
            true_class != "fish" &  pred_class != "fish" ~ "TN"
        )
    )

但是，此代码非常重复且庞大。 我不确定如何简化这一点。 有没有人有什么建议？ 谢谢你。

Answer 1

这是map的一个选项，我们在其中循环数据集的唯一元素，根据 OP 帖子中指定的条件在循环中创建带有transmute的列，并将这些列与原始数据绑定

library(dplyr)
library(purrr)
library(stringr)

map_dfc(unique(unlist(mydf)), ~ 
      mydf %>% 
           transmute(!! str_c(.x, '_conf') := 
        case_when(true_class == .x &  pred_class == .x ~ "TP",
            true_class == .x &  pred_class != .x ~ "FN",
            true_class != .x &  pred_class == .x ~ "FP",
            true_class != .x &  pred_class != .x ~ "TN"
        ))) %>% 
   bind_cols(mydf, .)

-输出

#     pred_class true_class dog_conf cat_conf fish_conf
#1         dog        cat       FP       FN        TN
#2         dog        cat       FP       FN        TN
#3        fish        dog       FN       TN        FP
#4         cat        cat       TN       TP        TN
#5         cat        cat       TN       TP        TN
#6         dog        dog       TP       TN        TN
#7        fish        dog       FN       TN        FP
#8         cat        cat       TN       TP        TN
#9         dog        dog       TP       TN        TN
#10       fish       fish       TN       TN        TP

或者在 key val 数据集上使用merge

keydat <- data.frame(pred_class = c(TRUE, TRUE, FALSE, FALSE), 
   true_class = c(TRUE, FALSE, TRUE, FALSE), 
  conf = c("TP", "FN", "FP", "TN"))

un1 <- unique(unlist(mydf))
mydf[paste0(un1, "_conf")] <- lapply(un1, function(x)
             merge(mydf == x, keydat, all.x = TRUE)$conf)

Answer 2

除了@akrun 的出色回答，如果您希望确定每个预测的状态（TP/TN/FP/FN）以计算其他统计/指标，其中许多可以由插入符号 package提供，例如

library(caret)
mydf <- structure(list(pred_class = c("dog", "dog", "fish", "cat", "cat", 
                                      "dog", "fish", "cat", "dog", "fish"), true_class = c("cat", "cat", 
                                                                                           "dog", "cat", "cat", "dog", "dog", "cat", "dog", "fish")), row.names = c(NA, 
                                                                                                                                                                    10L), class = "data.frame")

conf_matrix <- confusionMatrix(factor(mydf$pred_class),
                               reference = factor(mydf$true_class),
                               mode = "everything")
conf_matrix
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction cat dog fish
#>       cat    3   0    0
#>       dog    2   2    0
#>      fish    0   2    1
#> 
#> Overall Statistics
#>                                          
#>                Accuracy : 0.6             
#>                  95% CI : (0.2624, 0.8784)
#>     No Information Rate : 0.5             
#>     P-Value [Acc > NIR] : 0.377           
#>                                          
#>                   Kappa : 0.3939          
#>                                          
#>  Mcnemar's Test P-Value : NA              
#>
#> Statistics by Class:
#>
#>                      Class: cat Class: dog Class: fish
#> Sensitivity              0.6000     0.5000      1.0000
#> Specificity              1.0000     0.6667      0.7778
#> Pos Pred Value           1.0000     0.5000      0.3333
#> Neg Pred Value           0.7143     0.6667      1.0000
#> Precision                1.0000     0.5000      0.3333
#> Recall                   0.6000     0.5000      1.0000
#> F1                       0.7500     0.5000      0.5000
#> Prevalence               0.5000     0.4000      0.1000
#> Detection Rate           0.3000     0.2000      0.1000
#> Detection Prevalence     0.3000     0.4000      0.3000
#> Balanced Accuracy        0.8000     0.5833      0.8889

进一步说明：

对于带有符号的 2x2 表

            Reference   
Predicted   Event   No Event
Event           A        B
No Event        C        D

当“A”=TP、“B”=FP、“C”=FN、“D”=TN时，包/函数使用的公式为：

灵敏度 = A/(A+C)
特异性 = D/(B+D)
患病率 = (A+C)/(A+B+C+D)
PPV = (敏感性 * 患病率)/((敏感性 * 患病率) + ((1-特异性) * (1-患病率)))
NPV = (特异性 * (1-患病率))/(((1-敏感性) * 患病率) + ((特异性) * (1-患病率))) 检出率 = A/(A+B+C+D)
检出率 = (A+B)/(A+B+C+D)
平衡准确度 =（灵敏度+特异性）/2 精确度 = A/(A+B) 召回率 = A/(A+C)
F1 = (1+beta^2) * 精度 * 召回率/((beta^2 * 精度)+召回率)

有没有更快的方法在 R 中制作这个混淆矩阵表？

问题描述

2 个解决方案

解决方案1
4 已采纳 2021-04-18 19:52:20

解决方案2
2 2021-04-19 01:19:40

有没有更快的方法在 R 中制作这个混淆矩阵表？

问题描述

2 个解决方案

解决方案1 4 已采纳 2021-04-18 19:52:20

解决方案2 2 2021-04-19 01:19:40

解决方案1
4 已采纳 2021-04-18 19:52:20

解决方案2
2 2021-04-19 01:19:40