简体   繁体   English

R中的Dictionary()函数问题

[英]issue with Dictionary() function in R

I have been following an example of Bayesian classifiers according to the book of Lantz entitled "Machine Learning with R". 我根据Lantz的书“机器学习与R”一直在关注贝叶斯分类器的例子。 The case is a spam classifier that works with the data of the following link: 该案例是垃圾邮件分类器,它使用以下链接的数据:

http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/ http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/

In the code I have a problem in this part: 在代码中我在这部分有一个问题:

sms_train<-DocumentTermMatrix(sms_corpus_train,list(dictionary=sms_dict))
sms_test<-DocumentTermMatrix(sms_corpus_test,list(dictionary=sms_dict))

because it says that I should use the following instruction: 因为它说我应该使用以下指令:

sms_dict <- Dictionary(findFreqTerms(sms_dtm_train, 5))

The problem is that the Dictionary() function has been deprecated from new versions of tm. 问题是从新版本的tm中不推荐使用Dictionary()函数。 What I should do to accomplish what the books says: 我应该怎样做才能完成书中的内容:

A dictionary is a data structure allowing us to specify which words should appear in a document term matrix. 字典是一种数据结构,允许我们指定哪些单词应出现在文档术语矩阵中。 To limit our training and test matrixes to only the words in the preceding dictionary, use the following command 要将我们的训练和测试矩阵限制为仅包含前面词典中的单词,请使用以下命令

I have done the following: 我做了以下事情:

sms_dict<-findFreqTerms(sms_dtm_train,5)
sms_train<-DocumentTermMatrix(sms_corpus_train,list(dictionary=sms_dict))
sms_test<-DocumentTermMatrix(sms_corpus_test,list(dictionary=sms_dict))

But I am sure that I am not limiting the test matrices at it says in the book. 但我相信我并没有限制测试矩阵。 Even though the code is working, it does not give me the right results. 尽管代码工作正常,但它并没有给我正确的结果。 What can I modify in this case? 在这种情况下我可以修改什么?

The complete code for tracking purposes is the following: 用于跟踪目的的完整代码如下:

sms_raw<-read.csv("sms_spam.csv",stringsAsFactors=FALSE)
install.packages("tm")
library(tm)
sms_corpus<-Corpus(VectorSource(sms_raw$text))
corpus_clean<-tm_map(sms_corpus,content_transformer(tolower))
corpus_clean<-tm_map(corpus_clean,removeNumbers)
corpus_clean<-tm_map(corpus_clean,removeWords,stopwords())
corpus_clean<-tm_map(corpus_clean,stripWhitespace)
sms_dtm<-DocumentTermMatrix(corpus_clean)
sms_raw_train<-sms_raw[1:4169,]
sms_raw_test<-sms_raw[4170:5559,]
sms_dtm_train<-sms_dtm[1:4169,]
sms_dtm_test<-sms_dtm[4170:5559,]
sms_corpus_train<-corpus_clean[1:4169]
sms_corpus_test<-corpus_clean[4170:5559]
sms_dict<-findFreqTerms(sms_dtm_train,5)
sms_train<-DocumentTermMatrix(sms_corpus_train,list(dictionary=sms_dict))
sms_test<-DocumentTermMatrix(sms_corpus_test,list(dictionary=sms_dict))
convert_counts<-function(x){
x<-ifelse(x>0,1,0)
x<-factor(x,levels=c(0,1),labels=c("No","Yes"))
return(x)
}
sms_train<-apply(sms_train,MARGIN=2,convert_counts)
sms_test<-apply(sms_test,MARGIN=2,convert_counts)
library(e1071)
sms_classifier<-naiveBayes(sms_train,sms_raw_train$type)
sms_test_pred<-predict(sms_classifier,sms_test)
install.packages("gmodels")
library(gmodels)
CrossTable(sms_test_pred,sms_raw_test$type,prop.chisq=FALSE,prop.t=FALSE,dnn=c('predicted','actual'))

Thanks 谢谢

我有同样的问题并通过这样做解决了它:

CrossTable(sms_test_pred[["class"]], sms_raw_test$Type, prop.chisq = FALSE, prop.t = FALSE, dnn = c('predicted','actual'))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM