简体   繁体   English

错误:`data` 和 `reference` 应该是相同级别的因子。 Logistic 回归的混淆矩阵

[英]Error: `data` and `reference` should be factors with the same levels. Confusion matrix for Logistic Regression

I have seen lots of answers with regards to this particular error.关于这个特定错误,我已经看到了很多答案。 I haven't found any answer to it with specifics to my particular issue.对于我的特定问题,我还没有找到任何答案。 Therefore, my problem因此,我的问题

This is what I do:这就是我所做的:

    shortness_breath_data <- data_categ_nosev %>%
dplyr::select(shortness_breath, obesity, asthma, diabetes_type_one, diabetes_type_two, obesity, hypertension, heart_disease, lung_condition, liver_disease, kidney_disease, Covid_tested, Gender) 

And this is put(head(shortness_breath_data)) :这是put(head(shortness_breath_data))

structure(list(shortness_breath = structure(c(1L, 2L, 1L, 1L, 
1L, 2L), .Label = c("No", "Yes"), class = "factor"), obesity = structure(c(1L, 
1L, 2L, 2L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"), 
    asthma = structure(c(2L, 1L, 1L, 1L, 1L, 1L), .Label = c("No", 
    "Yes"), class = "factor"), diabetes_type_one = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"), 
    diabetes_type_two = structure(c(2L, 1L, 1L, 1L, 1L, 1L), .Label = c("No", 
    "Yes"), class = "factor"), hypertension = structure(c(1L, 
    1L, 1L, 2L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"), 
    heart_disease = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("No", 
    "Yes"), class = "factor"), lung_condition = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"), 
    liver_disease = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("No", 
    "Yes"), class = "factor"), kidney_disease = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"), 
    Covid_tested = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("negative", 
    "positive"), class = "factor"), Gender = structure(c(2L, 
    1L, 2L, 1L, 1L, 2L), .Label = c("Female", "Male", "Other"
    ), class = "factor")), row.names = c(NA, -6L), class = c("tbl_df", 
"tbl", "data.frame"), problems = structure(list(row = c(2910L, 
35958L), col = c("how_unwell", "how_unwell"), expected = c("a double", 
"a double"), actual = c("How Unwell", "How Unwell"), file = c("'/Users/gabrielburcea/Rprojects/data/data_lev_categorical_no_sev.csv'", 
"'/Users/gabrielburcea/Rprojects/data/data_lev_categorical_no_sev.csv'"
)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"
)))

And I divide this into training and testing dataset.我将其分为训练和测试数据集。

shortness_breath_data$shortness_breath <- as.factor(shortness_breath_data$shortness_breath)

n <- nrow(shortness_breath_data)
set.seed(22)
trainingdx <- sample(1:n, 0.7 * n)

train <- shortness_breath_data[trainingdx,]
validate <- shortness_breath_data[-trainingdx,]

train %>% distinct(shortness_breath)
validate %>% distinct(shortness_breath)

And just to do the same in case it will ease you job in finding the issue, I provided dput(head(train)) and dput(head(validate))并且只是为了防止您在查找问题时减轻您的工作,我提供了dput(head(train))dput(head(validate))

train dataset:训练数据集:

structure(list(shortness_breath = structure(c(1L, 1L, 1L, 1L, 
1L, 1L), .Label = c("No", "Yes"), class = "factor"), obesity = structure(c(2L, 
1L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"), 
    asthma = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("No", 
    "Yes"), class = "factor"), diabetes_type_one = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"), 
    diabetes_type_two = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("No", 
    "Yes"), class = "factor"), hypertension = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"), 
    heart_disease = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("No", 
    "Yes"), class = "factor"), lung_condition = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"), 
    liver_disease = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("No", 
    "Yes"), class = "factor"), kidney_disease = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"), 
    Covid_tested = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("negative", 
    "positive"), class = "factor"), Gender = structure(c(1L, 
    1L, 1L, 2L, 1L, 2L), .Label = c("Female", "Male", "Other"
    ), class = "factor")), row.names = c(NA, -6L), class = c("tbl_df", 
"tbl", "data.frame"), problems = structure(list(row = c(2910L, 
35958L), col = c("how_unwell", "how_unwell"), expected = c("a double", 
"a double"), actual = c("How Unwell", "How Unwell"), file = c("'/Users/gabrielburcea/Rprojects/data/data_lev_categorical_no_sev.csv'", 
"'/Users/gabrielburcea/Rprojects/data/data_lev_categorical_no_sev.csv'"
)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"
)))

validate dataset:验证数据集:

structure(list(shortness_breath = structure(c(1L, 2L, 2L, 1L, 
1L, 1L), .Label = c("No", "Yes"), class = "factor"), obesity = structure(c(1L, 
1L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"), 
    asthma = structure(c(2L, 1L, 1L, 1L, 1L, 1L), .Label = c("No", 
    "Yes"), class = "factor"), diabetes_type_one = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"), 
    diabetes_type_two = structure(c(2L, 1L, 1L, 1L, 1L, 1L), .Label = c("No", 
    "Yes"), class = "factor"), hypertension = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"), 
    heart_disease = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("No", 
    "Yes"), class = "factor"), lung_condition = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"), 
    liver_disease = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("No", 
    "Yes"), class = "factor"), kidney_disease = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"), 
    Covid_tested = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("negative", 
    "positive"), class = "factor"), Gender = structure(c(2L, 
    1L, 2L, 2L, 1L, 1L), .Label = c("Female", "Male", "Other"
    ), class = "factor")), row.names = c(NA, -6L), class = c("tbl_df", 
"tbl", "data.frame"), problems = structure(list(row = c(2910L, 
35958L), col = c("how_unwell", "how_unwell"), expected = c("a double", 
"a double"), actual = c("How Unwell", "How Unwell"), file = c("'/Users/gabrielburcea/Rprojects/data/data_lev_categorical_no_sev.csv'", 
"'/Users/gabrielburcea/Rprojects/data/data_lev_categorical_no_sev.csv'"
)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"
)))

And then, I build my logistic regression model with stepwise, forward method.然后,我使用逐步前向方法构建我的逻辑回归 model。

null_model <- glm(shortness_breath ~ 1, data = train, family = "binomial")

fm_shortness_breath <- glm(shortness_breath ~., data = train, family = "binomial")

stepmodel <- step(null_model, scope = list(lower = null_model, upper = fm_shortness_breath), direction = "forward")

Then I get my summary model and store the predictions in the source data frame.然后我得到我的摘要 model 并将预测存储在源数据框中。

summary(stepmodel)

validate$pred <- predict(stepmodel, validate, type = "response")

validate$real <- validate$shortness_breath

train$pred <- predict(stepmodel, train, type = "response")
train$real <- train$shortness_breath

Then I plot my ROC curve with no problem:然后我 plot 我的 ROC 曲线没有问题:

plot.roc(validate$real, validate$pred, col = "red", main = "ROC Validation Set", percent = TRUE, print.auc = TRUE)

Yet, when I am trying to get my confusion matrix, this is where I get my error.然而,当我试图得到我的混淆矩阵时,这就是我得到错误的地方。 But this is my code:但这是我的代码:

cm_stepmodel <- confusionMatrix(stepmodel, validate)

And then, the error comes in:然后,错误出现:

Error: `data` and `reference` should be factors with the same levels.

With Show Traceback:使用显示回溯:

3.
stop("`data` and `reference` should be factors with the same levels.", call. = FALSE)
2.
confusionMatrix.default(stepmodel, validate)
1.
confusionMatrix(stepmodel, validate)

I simply do not see the problem.我根本没有看到问题。 And tried several other options but did not work.并尝试了其他几个选项,但没有奏效。 I have reproduced, step by step the exact approach I am undertaking.我已经逐步复制了我正在采用的确切方法。 And I do not get my answer.我没有得到我的答案。 Also, I have tag this issue with RMarkdown as well, alongside caret and R, just in case.此外,我还用 RMarkdown 标记了这个问题,以及插入符号和 R,以防万一。

Also, libraries used are:此外,使用的库是:

library(tidyverse)
library(conflicted)
library(tidymodels)
library(ggrepel)
library(corrplot)
library(dplyr)
library(corrr) 
library(themis)
library(rsample)
library(caret)
library(forcats)
library(rcompanion)
library(MASS)
library(pROC)
library(ROCR)
library(data.table)

Try to convert your predicted probabilities to labels, and then run your confusionMatrix on this:尝试将您的预测概率转换为标签,然后在此运行您的confusionMatrix:

validate$pred <- predict(stepmodel, validate, type = "response")
validate$pred_label <- as.factor(ifelse(validate$pred >= 0.5, "Yes", "No"))
confusionMatrix(validate$real, validate$pred) # Error
confusionMatrix(validate$real, validate$pred_label) # This will work

Check that you are correctly assigning labels as in your original dataset in the validate$pred_label statement.检查您是否像在validate$pred_label语句中的原始数据集中一样正确分配标签。

I'm not particularly familiar with confusionMatrix , but the general idea is that you make predictions of labels and compare to the actual labels of your data.我对confusionMatrix矩阵不是特别熟悉,但总体思路是您对标签进行预测并与数据的实际标签进行比较。 It threw an error because you were comparing labels with probabilities -- you needed to assign the labels.它抛出了一个错误,因为您正在将标签与概率进行比较——您需要分配标签。 Please correct me if I made a conceptual error or coding mistake above.如果我在上面犯了概念错误或编码错误,请纠正我。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 为什么我的逻辑回归模型不输出 2 个水平的因子? (错误:`data` 和 `reference` 应该是具有相同水平的因素。) - Why isnt my logistic regression model output a factor of 2 levels? (Error: `data` and `reference` should be factors with the same levels.) 混淆矩阵错误:错误:`data`和`reference`应该是具有相同水平的因子 - Confusion Matrix Error: Error: `data` and `reference` should be factors with the same levels 错误:`data` 和 `reference` 应该是具有相同水平的因素。 使用混淆矩阵(插入符号) - Error: `data` and `reference` should be factors with the same levels. Using confusionMatrix (caret) Adaboost:混淆矩阵的问题 - `data` 和 `reference` 应该是具有相同水平的因素 - Adaboost: Problem with confusion matrix - `data` and `reference` should be factors with the same levels R:RF模型中的混淆矩阵返回错误:数据和“参考”应该是具有相同水平的因子 - R: Confusion matrix in RF model returns error: data` and `reference` should be factors with the same levels 错误:`data` 和 `reference` 应该是具有相同级别的因子&#39;不返回混淆矩阵 - Error: `data` and `reference` should be factors with the same levels' doesn't return confusion matrix 混淆矩阵错误:数据和参考因素必须具有相同的水平数 - Error in Confusion Matrix : the data and reference factors must have the same number of levels 混淆矩阵中的“具有相同水平的因素” - 'factors with the same levels' in Confusion Matrix confusionMatrix - 错误:`data` 和 `reference` 应该是具有相同水平的因素 - confusionMatrix - Error: `data` and `reference` should be factors with the same levels 什么地方出了错? 错误:`data` 和 `reference` 应该是具有相同水平的因素 - What went wrong? Error: `data` and `reference` should be factors with the same levels
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM