简体   繁体   English

我无法使用 R 中的一个 Class 生成分类的混淆矩阵

[英]I am not able to generate the confusion matrix of a classification with One Class in R

I am trying to understand and implement One Class Classification in R on dataset in Kaggle( https://www.kaggle.com/uciml/breast-cancer-wisconsin-data ).我试图在 Kaggle 的数据集( https://www.kaggle.com/uciml/breast-cancer-wisconsin-data )上理解和实施 R 中的一个 Class 分类。

When trying to print a confusion matrix you are giving the error:尝试打印混淆矩阵时出现错误:

Error in. All,equal (nrow (data): ncol (data)): invalid type argument

What am I doing wrong?我究竟做错了什么?

library(caret)
library(dplyr)
library(e1071)
library(NLP)
library(tm)
library(data.table)

ds = read.csv('C:/Users/hugos/Desktop/FS Dataset/Health/data_cancer.csv', 
              header = TRUE)

mycols <- c("id","diagnosis","radius_mean","texture_mean","perimeter_mean","area_mean",              
             "smoothness_mean","compactness_mean","concavity_mean",         
             "concave.points_mean","symmetry_mean","fractal_dimension_mean", 
             "radius_se","texture_se","perimeter_se",           
             "area_se","smoothness_se","compactness_se",         
             "concavity_se","concave.points_se","symmetry_se",            
             "fractal_dimension_se","radius_worst","texture_worst",          
             "perimeter_worst","area_worst","smoothness_worst",       
             "compactness_worst","concavity_worst","concave.points_worst",   
             "symmetry_worst","fractal_dimension_worst")

#Convert to numeric
setDT(ds)[, (mycols) := lapply(.SD, as.numeric), .SDcols = mycols]

#Convert classification to logical
data <- ds[,.(id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave.points_mean,symmetry_mean,fractal_dimension_mean,radius_se,texture_se,perimeter_se,area_se,smoothness_se,compactness_se,concavity_se,concave.points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave.points_worst,symmetry_worst,fractal_dimension_worst,diagnosis = ds$diagnosis == "TRUE")]

dataclean <- na.omit(data)

#Separating train and test
inTrain<-createDataPartition(1:nrow(dataclean),p=0.7,list=FALSE)
train<- dataclean[inTrain]
test <- dataclean[-inTrain]


svm.model<-svm(diagnosis ~ id+radius_mean+texture_mean+perimeter_mean+area_mean+smoothness_mean+compactness_mean+concavity_mean+concave.points_mean+symmetry_mean+fractal_dimension_mean+radius_se+texture_se+perimeter_se+area_se+smoothness_se+compactness_se+concavity_se+concave.points_se+symmetry_se+fractal_dimension_se+radius_worst+texture_worst+perimeter_worst+area_worst+smoothness_worst+compactness_worst+concavity_worst+concave.points_worst+symmetry_worst+fractal_dimension_worst, data = train,
               type='one-classification',
               trControl = fitControl,
               nu=0.10,
               scale=TRUE,
               kernel="radial",
               metric = "ROC")

#Perform predictions 
svm.predtrain<-predict(svm.model,train)
svm.predtest<-predict(svm.model,test)

confTrain <- table(Predicted=svm.predtrain,
                   Reference=train$diagnosis[as.integer(names(svm.predtrain))])
confTest <- table(Predicted=svm.predtest,
                  Reference=test$diagnosis[as.integer(names(svm.predtest))])

confusionMatrix(confTest,positive='TRUE')

print(confTrain)
print(confTest)

Your problem is on this line:你的问题在这一行:

#Convert classification to logical
data <- ds[, .(id, radius_mean, ..., diagnosis = ds$diagnosis == "TRUE")]

I'm assuming you are using R version 4.0, since the default behaviour of the read.csv function is to now not convert character columns into factors.我假设您使用的是 R 4.0 版,因为read.csv function 的默认行为现在是将字符列转换为因子。 This command:这个命令:

#Convert to numeric
setDT(ds)[, (mycols) := lapply(.SD, as.numeric), .SDcols = mycols]

will then convert all diagnoses to NA, since they are either "M" or "B" representing malignant and benign, respectively.然后会将所有诊断转换为 NA,因为它们是分别代表恶性和良性的“M”或“B”。

So, make sure that you are converting strings to factors when importing the data.因此,请确保在导入数据时将字符串转换为因子。

ds = read.csv('.../data_cancer.csv', header = TRUE, stringsAsFactors = TRUE)
str(ds)
'data.frame':   569 obs. of  33 variables:
 $ id                     : int  842302 842517 84300903 84348301 84358402 843786 844359 ...
 $ diagnosis              : Factor w/ 2 levels "B","M": 2 2 2 2 2 2 2 2 2 2 ...

I guess it will take some people a while to get used to this new behaviour of R. Your command to convert the classification to logical should then be:我想有些人需要一段时间才能习惯 R 的这种新行为。将分类转换为逻辑的命令应该是:

data <- ds[, .(id, radius_mean, ..., diagnosis = diagnosis == 2)] # or  == 1 ?

Which then makes all your remaining commands work.然后使您所有剩余的命令都起作用。

confusionMatrix(confTest, positive='TRUE')

Confusion Matrix and Statistics

         Reference
Predicted FALSE TRUE
    FALSE    10    8  # Note these numbers may change
    TRUE    100   50

               Accuracy : 0.3571          
                 95% CI : (0.2848, 0.4346)
    No Information Rate : 0.6548          
    P-Value [Acc > NIR] : 1               

                  Kappa : -0.0342         

 Mcnemar's Test P-Value : <2e-16          

            Sensitivity : 0.86207         
            Specificity : 0.09091         
         Pos Pred Value : 0.33333         
         Neg Pred Value : 0.55556         
             Prevalence : 0.34524         
         Detection Rate : 0.29762         
   Detection Prevalence : 0.89286         
      Balanced Accuracy : 0.47649         

       'Positive' Class : TRUE

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM