[英]I am not able to generate the confusion matrix of a classification with One Class in R
I am trying to understand and implement One Class Classification in R on dataset in Kaggle( https://www.kaggle.com/uciml/breast-cancer-wisconsin-data ).我试图在 Kaggle 的数据集( https://www.kaggle.com/uciml/breast-cancer-wisconsin-data )上理解和实施 R 中的一个 Class 分类。
When trying to print a confusion matrix you are giving the error:尝试打印混淆矩阵时出现错误:
Error in. All,equal (nrow (data): ncol (data)): invalid type argument
What am I doing wrong?我究竟做错了什么?
library(caret)
library(dplyr)
library(e1071)
library(NLP)
library(tm)
library(data.table)
ds = read.csv('C:/Users/hugos/Desktop/FS Dataset/Health/data_cancer.csv',
header = TRUE)
mycols <- c("id","diagnosis","radius_mean","texture_mean","perimeter_mean","area_mean",
"smoothness_mean","compactness_mean","concavity_mean",
"concave.points_mean","symmetry_mean","fractal_dimension_mean",
"radius_se","texture_se","perimeter_se",
"area_se","smoothness_se","compactness_se",
"concavity_se","concave.points_se","symmetry_se",
"fractal_dimension_se","radius_worst","texture_worst",
"perimeter_worst","area_worst","smoothness_worst",
"compactness_worst","concavity_worst","concave.points_worst",
"symmetry_worst","fractal_dimension_worst")
#Convert to numeric
setDT(ds)[, (mycols) := lapply(.SD, as.numeric), .SDcols = mycols]
#Convert classification to logical
data <- ds[,.(id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave.points_mean,symmetry_mean,fractal_dimension_mean,radius_se,texture_se,perimeter_se,area_se,smoothness_se,compactness_se,concavity_se,concave.points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave.points_worst,symmetry_worst,fractal_dimension_worst,diagnosis = ds$diagnosis == "TRUE")]
dataclean <- na.omit(data)
#Separating train and test
inTrain<-createDataPartition(1:nrow(dataclean),p=0.7,list=FALSE)
train<- dataclean[inTrain]
test <- dataclean[-inTrain]
svm.model<-svm(diagnosis ~ id+radius_mean+texture_mean+perimeter_mean+area_mean+smoothness_mean+compactness_mean+concavity_mean+concave.points_mean+symmetry_mean+fractal_dimension_mean+radius_se+texture_se+perimeter_se+area_se+smoothness_se+compactness_se+concavity_se+concave.points_se+symmetry_se+fractal_dimension_se+radius_worst+texture_worst+perimeter_worst+area_worst+smoothness_worst+compactness_worst+concavity_worst+concave.points_worst+symmetry_worst+fractal_dimension_worst, data = train,
type='one-classification',
trControl = fitControl,
nu=0.10,
scale=TRUE,
kernel="radial",
metric = "ROC")
#Perform predictions
svm.predtrain<-predict(svm.model,train)
svm.predtest<-predict(svm.model,test)
confTrain <- table(Predicted=svm.predtrain,
Reference=train$diagnosis[as.integer(names(svm.predtrain))])
confTest <- table(Predicted=svm.predtest,
Reference=test$diagnosis[as.integer(names(svm.predtest))])
confusionMatrix(confTest,positive='TRUE')
print(confTrain)
print(confTest)
Your problem is on this line:你的问题在这一行:
#Convert classification to logical
data <- ds[, .(id, radius_mean, ..., diagnosis = ds$diagnosis == "TRUE")]
I'm assuming you are using R version 4.0, since the default behaviour of the read.csv
function is to now not convert character columns into factors.我假设您使用的是 R 4.0 版,因为read.csv
function 的默认行为现在是不将字符列转换为因子。 This command:这个命令:
#Convert to numeric
setDT(ds)[, (mycols) := lapply(.SD, as.numeric), .SDcols = mycols]
will then convert all diagnoses to NA, since they are either "M" or "B" representing malignant and benign, respectively.然后会将所有诊断转换为 NA,因为它们是分别代表恶性和良性的“M”或“B”。
So, make sure that you are converting strings to factors when importing the data.因此,请确保在导入数据时将字符串转换为因子。
ds = read.csv('.../data_cancer.csv', header = TRUE, stringsAsFactors = TRUE)
str(ds)
'data.frame': 569 obs. of 33 variables:
$ id : int 842302 842517 84300903 84348301 84358402 843786 844359 ...
$ diagnosis : Factor w/ 2 levels "B","M": 2 2 2 2 2 2 2 2 2 2 ...
I guess it will take some people a while to get used to this new behaviour of R. Your command to convert the classification to logical should then be:我想有些人需要一段时间才能习惯 R 的这种新行为。将分类转换为逻辑的命令应该是:
data <- ds[, .(id, radius_mean, ..., diagnosis = diagnosis == 2)] # or == 1 ?
Which then makes all your remaining commands work.然后使您所有剩余的命令都起作用。
confusionMatrix(confTest, positive='TRUE')
Confusion Matrix and Statistics
Reference
Predicted FALSE TRUE
FALSE 10 8 # Note these numbers may change
TRUE 100 50
Accuracy : 0.3571
95% CI : (0.2848, 0.4346)
No Information Rate : 0.6548
P-Value [Acc > NIR] : 1
Kappa : -0.0342
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.86207
Specificity : 0.09091
Pos Pred Value : 0.33333
Neg Pred Value : 0.55556
Prevalence : 0.34524
Detection Rate : 0.29762
Detection Prevalence : 0.89286
Balanced Accuracy : 0.47649
'Positive' Class : TRUE
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.