混淆矩阵中的“具有相同水平的因素”

Question

我正在尝试制作决策树，但是当我在最后一行制作混淆矩阵时出现此错误：

Error : `data` and `reference` should be factors with the same levels

这是我的代码：

library(rpart)
library(caret)
library(dplyr)
library(rpart.plot)
library(xlsx)
library(caTools)
library(data.tree)
library(e1071)

#Loading the Excel File
library(readxl)
FINALDATA <- read_excel("Desktop/FINALDATA.xlsm")
View(FINALDATA)
df <- FINALDATA
View(df)

#Selecting the meaningful columns for prediction
#df <- select(df, City, df$`Customer type`, Gender, Quantity, Total, Date, Time, Payment, Rating)
df <- select(df, City, `Customer type`, Gender, Quantity, Total, Date, Time, Payment, Rating)

#making sure the data is in the right format 
df <- mutate(df, City= as.character(City), `Customer type`= as.character(`Customer type`), Gender= as.character(Gender), Quantity= as.numeric(Quantity), Total= as.numeric(Total), Time= as.numeric(Time), Payment = as.character(Payment), Rating= as.numeric(Rating))

#Splitting into training and testing data
set.seed(123)
sample = sample.split('Customer type', SplitRatio = .70)
train = subset(df, sample==TRUE)
test = subset(df, sample == FALSE)

#Training the Decision Tree Classifier
tree <- rpart(df$`Customer type` ~., data = train)

#Predictions
tree.customertype.predicted <- predict(tree, test, type= 'class')

#confusion Matrix for evaluating the model
confusionMatrix(tree.customertype.predicted, test$`Customer type`)

因此，我尝试按照另一个主题中所述执行此操作：

confusionMatrix(table(tree.customertype.predicted, test$`Customer type`))

但我仍然有一个错误：

Error in !all.equal(nrow(data), ncol(data)) : argument type is invalid

Answer 1

尽量保持train和test的因子水平与df相同。

train$`Customer type` <- factor(train$`Customer type`, unique(df$`Customer type`))
test$`Customer type` <- factor(test$`Customer type`, unique(df$`Customer type`))

Answer 2

我制作了一个玩具数据集并检查了您的代码。 有几个问题：

R 可以更轻松地使用遵循特定样式的变量名称。 您的“客户类型”变量中有一个空格。 通常，避免空格时，编码会更容易。 所以我将它重命名为“Customer_type”。对于您的 data.frame，您可以简单地将 go 放入源文件中，或者使用names(df) <- gsub("Customer type", "Customer_type", names(df)) 。
我将“Customer_type”编码为一个因素。 对你来说，这看起来像df$Customer_type <- factor(df$Customer_type)
sample.split()的文档说第一个参数“Y”应该是标签向量。 但是在您的代码中，您给出了变量名称。 标签是因子水平的名称。 在我的示例中，这些级别是高、中和低。 要查看变量的级别，您可以使用levels(df$Customer_type) 。 将这些作为字符向量输入到sample.split() 。
调整rpart()调用，如下所示。

通过这些调整，您的代码可能没问题。

# toy data
df <- data.frame(City = factor(sample(c("Paris", "Tokyo", "Miami"), 100, replace = T)),
                 Customer_type = factor(sample(c("High", "Med", "Low"), 100, replace = T)),
                 Gender = factor(sample(c("Female", "Male"), 100, replace = T)),
                 Quantity = sample(1:10, 100, replace = T),
                 Total = sample(1:10, 100, replace = T),
                 Date = sample(seq(as.Date('2020/01/01'), as.Date('2020/12/31'), by="day"), 100),
                 Rating = factor(sample(1:5, 100, replace = T)))

library(rpart)
library(caret)
library(dplyr)
library(caTools)
library(data.tree)
library(e1071)

#Splitting into training and testing data
set.seed(123)
sample = sample.split(levels(df$Customer_type), SplitRatio = .70) # ADJUST YOUR CODE TO MATCH YOUR FACTOR LABEL NAMES
train = subset(df, sample==TRUE)
test = subset(df, sample == FALSE)

#Training the Decision Tree Classifier
tree <- rpart(Customer_type ~., data = train) # ADJUST YOUR CODE SO IT'S LIKE THIS

#Predictions
tree.customertype.predicted <- predict(tree, test, type= 'class')

#confusion Matrix for evaluating the model
confusionMatrix(tree.customertype.predicted, test$Customer_type)

混淆矩阵中的“具有相同水平的因素”

问题描述

2 个解决方案

解决方案1
1 2021-02-26 03:54:52

解决方案2
1 已采纳 2021-02-26 12:53:07

混淆矩阵中的“具有相同水平的因素”

问题描述

2 个解决方案

解决方案1 1 2021-02-26 03:54:52

解决方案2 1 已采纳 2021-02-26 12:53:07

解决方案1
1 2021-02-26 03:54:52

解决方案2
1 已采纳 2021-02-26 12:53:07