混淆矩阵中的“具有相同水平的因素”

Question

I'm trying to make a decision tree but this error comes up when I make a confusion matrix in the last line:我正在尝试制作决策树，但是当我在最后一行制作混淆矩阵时出现此错误：

Error : `data` and `reference` should be factors with the same levels

Here's my code:这是我的代码：

library(rpart)
library(caret)
library(dplyr)
library(rpart.plot)
library(xlsx)
library(caTools)
library(data.tree)
library(e1071)

#Loading the Excel File
library(readxl)
FINALDATA <- read_excel("Desktop/FINALDATA.xlsm")
View(FINALDATA)
df <- FINALDATA
View(df)

#Selecting the meaningful columns for prediction
#df <- select(df, City, df$`Customer type`, Gender, Quantity, Total, Date, Time, Payment, Rating)
df <- select(df, City, `Customer type`, Gender, Quantity, Total, Date, Time, Payment, Rating)

#making sure the data is in the right format 
df <- mutate(df, City= as.character(City), `Customer type`= as.character(`Customer type`), Gender= as.character(Gender), Quantity= as.numeric(Quantity), Total= as.numeric(Total), Time= as.numeric(Time), Payment = as.character(Payment), Rating= as.numeric(Rating))

#Splitting into training and testing data
set.seed(123)
sample = sample.split('Customer type', SplitRatio = .70)
train = subset(df, sample==TRUE)
test = subset(df, sample == FALSE)

#Training the Decision Tree Classifier
tree <- rpart(df$`Customer type` ~., data = train)

#Predictions
tree.customertype.predicted <- predict(tree, test, type= 'class')

#confusion Matrix for evaluating the model
confusionMatrix(tree.customertype.predicted, test$`Customer type`)

So I've tried to do this as said in another topic:因此，我尝试按照另一个主题中所述执行此操作：

confusionMatrix(table(tree.customertype.predicted, test$`Customer type`))

But I still have an error:但我仍然有一个错误：

Error in !all.equal(nrow(data), ncol(data)) : argument type is invalid

Answer 1

Try to keep factor levels of train and test same as df .尽量保持train和test的因子水平与df相同。

train$`Customer type` <- factor(train$`Customer type`, unique(df$`Customer type`))
test$`Customer type` <- factor(test$`Customer type`, unique(df$`Customer type`))

Answer 2

I made a toy data set and examined your code.我制作了一个玩具数据集并检查了您的代码。 There were a couple issues:有几个问题：

R has a easier time with variable names that follow a certain style. R 可以更轻松地使用遵循特定样式的变量名称。 Your 'Customer type' variable has a space in it.您的“客户类型”变量中有一个空格。 In general, coding is easier when you avoid spaces.通常，避免空格时，编码会更容易。 So I renamed it 'Customer_type". For your data.frame you could simply go into the source file, or use names(df) <- gsub("Customer type", "Customer_type", names(df)) .所以我将它重命名为“Customer_type”。对于您的 data.frame，您可以简单地将 go 放入源文件中，或者使用names(df) <- gsub("Customer type", "Customer_type", names(df)) 。
I coded 'Customer_type' as a factor.我将“Customer_type”编码为一个因素。 For you this will look like df$Customer_type <- factor(df$Customer_type)对你来说，这看起来像df$Customer_type <- factor(df$Customer_type)
The documentation for sample.split() says the first argument 'Y' should be a vector of labels. sample.split()的文档说第一个参数“Y”应该是标签向量。 But in your code you gave the variable name.但是在您的代码中，您给出了变量名称。 The labels are the names of the levels of the factor.标签是因子水平的名称。 In my example these levels are High, Med and Low.在我的示例中，这些级别是高、中和低。 To see the levels of your variable you could use levels(df$Customer_type) .要查看变量的级别，您可以使用levels(df$Customer_type) 。 Input these to sample.split() as a character vector.将这些作为字符向量输入到sample.split() 。
Adjust the rpart() call as shown below.调整rpart()调用，如下所示。

With these adjustments, your code might be OK.通过这些调整，您的代码可能没问题。

# toy data
df <- data.frame(City = factor(sample(c("Paris", "Tokyo", "Miami"), 100, replace = T)),
                 Customer_type = factor(sample(c("High", "Med", "Low"), 100, replace = T)),
                 Gender = factor(sample(c("Female", "Male"), 100, replace = T)),
                 Quantity = sample(1:10, 100, replace = T),
                 Total = sample(1:10, 100, replace = T),
                 Date = sample(seq(as.Date('2020/01/01'), as.Date('2020/12/31'), by="day"), 100),
                 Rating = factor(sample(1:5, 100, replace = T)))

library(rpart)
library(caret)
library(dplyr)
library(caTools)
library(data.tree)
library(e1071)

#Splitting into training and testing data
set.seed(123)
sample = sample.split(levels(df$Customer_type), SplitRatio = .70) # ADJUST YOUR CODE TO MATCH YOUR FACTOR LABEL NAMES
train = subset(df, sample==TRUE)
test = subset(df, sample == FALSE)

#Training the Decision Tree Classifier
tree <- rpart(Customer_type ~., data = train) # ADJUST YOUR CODE SO IT'S LIKE THIS

#Predictions
tree.customertype.predicted <- predict(tree, test, type= 'class')

#confusion Matrix for evaluating the model
confusionMatrix(tree.customertype.predicted, test$Customer_type)

混淆矩阵中的“具有相同水平的因素”

问题描述

2 个解决方案

解决方案1
1 2021-02-26 03:54:52

解决方案2
1 已采纳 2021-02-26 12:53:07

混淆矩阵中的“具有相同水平的因素”

问题描述

2 个解决方案

解决方案1 1 2021-02-26 03:54:52

解决方案2 1 已采纳 2021-02-26 12:53:07

解决方案1
1 2021-02-26 03:54:52

解决方案2
1 已采纳 2021-02-26 12:53:07