简体   繁体   English

在tm包中创建TermDocumentMatrix时出错

[英]Error creating TermDocumentMatrix in tm package

I am new to the tm package, and have run into an obstacle when trying to apply the TermDocumentMatrix function. 我是tm包的新手,尝试应用TermDocumentMatrix函数时遇到了障碍。

I have used the following code up until the function fails: 在功能失败之前,我一直使用以下代码:

myCorpus <- Corpus(VectorSource(posts$message))
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpus <- tm_map(myCorpus, removeNumbers)

removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)

myCorpus <- tm_map(myCorpus, removeURL)

myStopwords <- c(stopwords("english"))
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)

myCorpusCopy <- myCorpus 
myCorpus <- tm_map(myCorpus, stemDocument)

Upon inspection it seems as if the list of documents is what it should be: 经检查,似乎文件清单应为:

> for(i in 1:5) {
+   cat(paste("[[", i, "]] ", sep =""))
+   writeLines(myCorpus[[i]])
+ }
[[1]] syntel recruitment drive   week  freshers  newregistrationlink    passout graduates
qualification   graduatebebtechmcamemtech
syntel registration link  
limited referrals available 
comment  emailids  reference  future job upd
[[2]] dont miss  opportunity   get placed  one   best mnc companies   world ebay freshers  week  january 
qualification   graduate can apply
ebay registration link  
comment  emailids fast beacuse    referrals left
[[3]] recent passouts      eligible  apply  wipro  go   updated link  lastday reference drive jan  apply link  fresher referral
apply link 
go   link  apply asap
[[4]] robertbosch recruitment drive   week  freshers  newregistrationlink    passout graduates
qualification   graduatebebtechmcamemtech
robertbosch registration link  
limited referrals available 
comment  emailids  reference  future job upd
[[5]] mega job openings   year
mphasis recruitment  freshers january 
qualification   btech bsc bca  graduates mca mba  mtech post graduates
mphasis registration link  
comment  emailids  comment box  reference  future job updates   emailbox    

however, after creating a copy of corpus for stem completion, the problem arises. 但是,在创建了用于完成词干的主体的副本之后,出现了问题。

myCorpus <- tm_map(myCorpus, stemCompletion,
                   dictionary = myCorpusCopy, lazy = TRUE)
> tdm <- TermDocumentMatrix(myCorpus, control = list(wordLengths = c(1, Inf)))
Error in UseMethod("meta", x) : 
  no applicable method for 'meta' applied to an object of class "try-error"
In addition: Warning messages:
1: In mclapply(x$content[i], function(d) tm_reduce(d, x$lazy$maps)) :
  all scheduled cores encountered errors in user code
2: In mclapply(unname(content(x)), termFreq, control) :
  all scheduled cores encountered errors in user code

Any ideas for a workaround? 有任何解决方法的想法吗?

I think that you have to recall 我认为你必须记得

myCorpus <- Corpus(VectorSource(myCorpus))

before using the TermDocumentMatrix , your final piece of code will be: 在使用TermDocumentMatrix之前,您的最后一段代码将是:

myCorpus <- tm_map(myCorpus, stemCompletion, dictionary = myCorpusCopy)
myCorpus <- Corpus(VectorSource(myCorpus))
tdm <- TermDocumentMatrix(myCorpus, control = list(wordLengths = c(1, Inf)))

If until the stemming of the document no error occured, the previous instructions will solve your problem. 如果在删除文档之前未发生任何错误,那么前面的说明将解决您的问题。

Otherwise, you might try first: 否则,您可以先尝试:

myCorpus <- tm_map(myCorpus, PlainTextDocument)

Before you use 使用之前

myCorpus <- Corpus(VectorSource(myCorpus))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM