简体   繁体   English

如何创建混淆矩阵并确定 R 中 50% 截止值的影响?

[英]How can I create a confusion matrix and determine the effect of a 50% cuttoff in R?

I have the values of a certain confusion matrix I want to analyze and determine the effect a cuttoff will have.我有某个混淆矩阵的值,我想分析并确定截止值的影响。 Lets say I have these vectors:假设我有这些向量:

v1 <- c(200, 25)
v2 <- c(10, 400)

these are the values of a confusion matrix (transposed, row 1 would be (10, 200), row 2 would be (400, 25). I want to know how a 50% cuttoff would affect the false negative.这些是混淆矩阵的值(转置后,第 1 行将是 (10, 200),第 2 行将是 (400, 25)。我想知道 50% 的截止值将如何影响假阴性。

You cannot do this with just a confusion matrix.你不能只用一个混淆矩阵来做到这一点。 The cutoff is used to create a confusion matrix.截止值用于创建混淆矩阵。 You need to have the data the confusion matrix is made from to assess the effects of different cutoffs.您需要拥有构成混淆矩阵的数据,以评估不同截止值的影响。 Here is an example.这是一个例子。 Let's say we have some data like the following:假设我们有如下一些数据:

data <- structure(list(response = c(1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                                    1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1), 
                       y = c(4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 
                             4, 3, 3, 3, 3, 3, 4, 5, 5, 5, 5, 5, 4), 
                       z = c(4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 
                             4, 1, 2, 1, 1, 2, 2, 4, 2, 1, 2, 2, 4, 6, 8, 2)), 
                  class = "data.frame", row.names = c(NA, -32L))

head(data)
  response y z
1        1 4 4
2        1 4 4
3        1 4 1
4        0 3 1
5        0 3 2
6        0 3 1

Let's say we fit a model to predict response based on y and z .假设我们拟合 model 以预测基于yzresponse

mod <- glm(response ~ y + z, data = data, family = "binomial")

Now we can predict the values of response and add them to the data.现在我们可以预测response值并将它们添加到数据中。

data$fit <- predict(mod, type = "response")
head(data)
  response y z          fit
1        1 4 4 4.217892e-01
2        1 4 4 4.217892e-01
3        1 4 1 8.435784e-01
4        0 3 1 2.345578e-09
5        0 3 2 1.204047e-09
6        0 3 1 2.345578e-09

Our fit values are not useful, because they are continuous, and the response is binary.我们的fit值没有用,因为它们是连续的,并且response是二元的。 So, we choose a cutoff, say 0.5 (or 50%).因此,我们选择一个截止值,比如0.5 (或 50%)。 When we do this, we lose information.当我们这样做时,我们会丢失信息。 We know whether predicted is above or below the cutoff, but we lose the original value.我们知道predicted值是高于还是低于截止值,但我们丢失了原始值。

data$predicted <- (data$fit >= 0.5) ^ 1 # TRUE ^ 1 = 1, FALSE ^ 1 = 0

  response y z          fit predicted
1        1 4 4 4.217892e-01         0
2        1 4 4 4.217892e-01         0
3        1 4 1 8.435784e-01         1
4        0 3 1 2.345578e-09         0
5        0 3 2 1.204047e-09         0
6        0 3 1 2.345578e-09         0

The caret package has a function to generate a confusion matrix. caret package 有一个 function 来生成混淆矩阵。

library(caret)
confusionMatrix(factor(data$predicted), factor(data$response), positive = "1")$table

          Reference
Prediction  0  1
         0 17  2
         1  2 11
# 2 false negatives, false negative rate = 15.3%                                          
        

We cannot recreate the original data from this confusion matrix.我们无法从这个混淆矩阵中重新创建原始数据。 If you want to choose a different cutoff, you will to go back to the original data.如果你想选择一个不同的截止点,你将 go 返回到原始数据。 Then you will get a new confusion matrix.然后你会得到一个新的混淆矩阵。

# cutoff = 0.25
data$predicted2 <- (data$fit >= 0.25) ^ 1 # TRUE ^ 1 = 1, FALSE ^ 1 = 0
confusionMatrix(factor(data$predicted2), factor(data$response), positive = "1")$table

          Reference
Prediction  0  1
         0 15  0
         1  4 13
# 0 false negatives, false negative rate = 0%
     
       

You already seem to have the confusion matrix.您似乎已经有了混淆矩阵。 If you want additional statistics on it, you can use caret package. Just create a matrix and make it's class table .如果你想要额外的统计数据,你可以使用caret package。只需创建一个矩阵并将其设为 class table

m = cbind(v2, v1)
dimnames(m) = list(G1 = c("A", "B"), G2 = c("A", "B"))
attr(m, "class") = "table"
CM = caret::confusionMatrix(m)
CM

As for the effect of different cutoffs, the other answer provides more information.至于不同截止值的影响,另一个答案提供了更多信息。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM