[英]Error: `data` and `reference` should be factors with the same levels. Confusion matrix for Logistic Regression
I have seen lots of answers with regards to this particular error.关于这个特定错误,我已经看到了很多答案。 I haven't found any answer to it with specifics to my particular issue.
对于我的特定问题,我还没有找到任何答案。 Therefore, my problem
因此,我的问题
This is what I do:这就是我所做的:
shortness_breath_data <- data_categ_nosev %>%
dplyr::select(shortness_breath, obesity, asthma, diabetes_type_one, diabetes_type_two, obesity, hypertension, heart_disease, lung_condition, liver_disease, kidney_disease, Covid_tested, Gender)
And this is put(head(shortness_breath_data))
:这是
put(head(shortness_breath_data))
:
structure(list(shortness_breath = structure(c(1L, 2L, 1L, 1L,
1L, 2L), .Label = c("No", "Yes"), class = "factor"), obesity = structure(c(1L,
1L, 2L, 2L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"),
asthma = structure(c(2L, 1L, 1L, 1L, 1L, 1L), .Label = c("No",
"Yes"), class = "factor"), diabetes_type_one = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"),
diabetes_type_two = structure(c(2L, 1L, 1L, 1L, 1L, 1L), .Label = c("No",
"Yes"), class = "factor"), hypertension = structure(c(1L,
1L, 1L, 2L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"),
heart_disease = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("No",
"Yes"), class = "factor"), lung_condition = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"),
liver_disease = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("No",
"Yes"), class = "factor"), kidney_disease = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"),
Covid_tested = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("negative",
"positive"), class = "factor"), Gender = structure(c(2L,
1L, 2L, 1L, 1L, 2L), .Label = c("Female", "Male", "Other"
), class = "factor")), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"), problems = structure(list(row = c(2910L,
35958L), col = c("how_unwell", "how_unwell"), expected = c("a double",
"a double"), actual = c("How Unwell", "How Unwell"), file = c("'/Users/gabrielburcea/Rprojects/data/data_lev_categorical_no_sev.csv'",
"'/Users/gabrielburcea/Rprojects/data/data_lev_categorical_no_sev.csv'"
)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"
)))
And I divide this into training and testing dataset.我将其分为训练和测试数据集。
shortness_breath_data$shortness_breath <- as.factor(shortness_breath_data$shortness_breath)
n <- nrow(shortness_breath_data)
set.seed(22)
trainingdx <- sample(1:n, 0.7 * n)
train <- shortness_breath_data[trainingdx,]
validate <- shortness_breath_data[-trainingdx,]
train %>% distinct(shortness_breath)
validate %>% distinct(shortness_breath)
And just to do the same in case it will ease you job in finding the issue, I provided dput(head(train))
and dput(head(validate))
并且只是为了防止您在查找问题时减轻您的工作,我提供了
dput(head(train))
和dput(head(validate))
train dataset:训练数据集:
structure(list(shortness_breath = structure(c(1L, 1L, 1L, 1L,
1L, 1L), .Label = c("No", "Yes"), class = "factor"), obesity = structure(c(2L,
1L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"),
asthma = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("No",
"Yes"), class = "factor"), diabetes_type_one = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"),
diabetes_type_two = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("No",
"Yes"), class = "factor"), hypertension = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"),
heart_disease = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("No",
"Yes"), class = "factor"), lung_condition = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"),
liver_disease = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("No",
"Yes"), class = "factor"), kidney_disease = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"),
Covid_tested = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("negative",
"positive"), class = "factor"), Gender = structure(c(1L,
1L, 1L, 2L, 1L, 2L), .Label = c("Female", "Male", "Other"
), class = "factor")), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"), problems = structure(list(row = c(2910L,
35958L), col = c("how_unwell", "how_unwell"), expected = c("a double",
"a double"), actual = c("How Unwell", "How Unwell"), file = c("'/Users/gabrielburcea/Rprojects/data/data_lev_categorical_no_sev.csv'",
"'/Users/gabrielburcea/Rprojects/data/data_lev_categorical_no_sev.csv'"
)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"
)))
validate dataset:验证数据集:
structure(list(shortness_breath = structure(c(1L, 2L, 2L, 1L,
1L, 1L), .Label = c("No", "Yes"), class = "factor"), obesity = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"),
asthma = structure(c(2L, 1L, 1L, 1L, 1L, 1L), .Label = c("No",
"Yes"), class = "factor"), diabetes_type_one = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"),
diabetes_type_two = structure(c(2L, 1L, 1L, 1L, 1L, 1L), .Label = c("No",
"Yes"), class = "factor"), hypertension = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"),
heart_disease = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("No",
"Yes"), class = "factor"), lung_condition = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"),
liver_disease = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("No",
"Yes"), class = "factor"), kidney_disease = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"),
Covid_tested = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("negative",
"positive"), class = "factor"), Gender = structure(c(2L,
1L, 2L, 2L, 1L, 1L), .Label = c("Female", "Male", "Other"
), class = "factor")), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"), problems = structure(list(row = c(2910L,
35958L), col = c("how_unwell", "how_unwell"), expected = c("a double",
"a double"), actual = c("How Unwell", "How Unwell"), file = c("'/Users/gabrielburcea/Rprojects/data/data_lev_categorical_no_sev.csv'",
"'/Users/gabrielburcea/Rprojects/data/data_lev_categorical_no_sev.csv'"
)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"
)))
And then, I build my logistic regression model with stepwise, forward method.然后,我使用逐步前向方法构建我的逻辑回归 model。
null_model <- glm(shortness_breath ~ 1, data = train, family = "binomial")
fm_shortness_breath <- glm(shortness_breath ~., data = train, family = "binomial")
stepmodel <- step(null_model, scope = list(lower = null_model, upper = fm_shortness_breath), direction = "forward")
Then I get my summary model and store the predictions in the source data frame.然后我得到我的摘要 model 并将预测存储在源数据框中。
summary(stepmodel)
validate$pred <- predict(stepmodel, validate, type = "response")
validate$real <- validate$shortness_breath
train$pred <- predict(stepmodel, train, type = "response")
train$real <- train$shortness_breath
Then I plot my ROC curve with no problem:然后我 plot 我的 ROC 曲线没有问题:
plot.roc(validate$real, validate$pred, col = "red", main = "ROC Validation Set", percent = TRUE, print.auc = TRUE)
Yet, when I am trying to get my confusion matrix, this is where I get my error.然而,当我试图得到我的混淆矩阵时,这就是我得到错误的地方。 But this is my code:
但这是我的代码:
cm_stepmodel <- confusionMatrix(stepmodel, validate)
And then, the error comes in:然后,错误出现:
Error: `data` and `reference` should be factors with the same levels.
With Show Traceback:使用显示回溯:
3.
stop("`data` and `reference` should be factors with the same levels.", call. = FALSE)
2.
confusionMatrix.default(stepmodel, validate)
1.
confusionMatrix(stepmodel, validate)
I simply do not see the problem.我根本没有看到问题。 And tried several other options but did not work.
并尝试了其他几个选项,但没有奏效。 I have reproduced, step by step the exact approach I am undertaking.
我已经逐步复制了我正在采用的确切方法。 And I do not get my answer.
我没有得到我的答案。 Also, I have tag this issue with RMarkdown as well, alongside caret and R, just in case.
此外,我还用 RMarkdown 标记了这个问题,以及插入符号和 R,以防万一。
Also, libraries used are:此外,使用的库是:
library(tidyverse)
library(conflicted)
library(tidymodels)
library(ggrepel)
library(corrplot)
library(dplyr)
library(corrr)
library(themis)
library(rsample)
library(caret)
library(forcats)
library(rcompanion)
library(MASS)
library(pROC)
library(ROCR)
library(data.table)
Try to convert your predicted probabilities to labels, and then run your confusionMatrix on this:尝试将您的预测概率转换为标签,然后在此运行您的confusionMatrix:
validate$pred <- predict(stepmodel, validate, type = "response")
validate$pred_label <- as.factor(ifelse(validate$pred >= 0.5, "Yes", "No"))
confusionMatrix(validate$real, validate$pred) # Error
confusionMatrix(validate$real, validate$pred_label) # This will work
Check that you are correctly assigning labels as in your original dataset in the validate$pred_label
statement.检查您是否像在
validate$pred_label
语句中的原始数据集中一样正确分配标签。
I'm not particularly familiar with confusionMatrix
, but the general idea is that you make predictions of labels and compare to the actual labels of your data.我对
confusionMatrix
矩阵不是特别熟悉,但总体思路是您对标签进行预测并与数据的实际标签进行比较。 It threw an error because you were comparing labels with probabilities -- you needed to assign the labels.它抛出了一个错误,因为您正在将标签与概率进行比较——您需要分配标签。 Please correct me if I made a conceptual error or coding mistake above.
如果我在上面犯了概念错误或编码错误,请纠正我。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.