R中的TermDocumentMatrix错误

Question

I have been working through numerous online examples of the {tm} package in R, attempting to create a TermDocumentMatrix. 我一直在研究R中{tm}包的许多在线示例，试图创建一个TermDocumentMatrix。 Creating and cleaning a corpus has been pretty straightforward, but I consistently encounter an error when I attempt to create a matrix. 创建和清理语料库非常简单，但是当我尝试创建矩阵时，我一直遇到错误。 The error is: 错误是：

Error in UseMethod("meta", x) : no applicable method for 'meta' applied to an object of class "character" In addition: Warning message: In mclapply(unname(content(x)), termFreq, control) : all scheduled cores encountered errors in user code UseMethod（“meta”，x）中的错误：“meta”没有适用于类“character”对象的适用方法此外：警告消息：在mclapply（unname（content（x）），termFreq，control）：all计划的核心在用户代码中遇到错误

For example, here is code from Jon Starkweather's text mining example . 例如，这里是Jon Starkweather的文本挖掘示例中的代码。 Apologies in advance for such long code, but this does produce a reproducible example. 为这么长的代码提前道歉，但这确实产生了一个可重复的例子。 Please note that the error comes at the end with the {tdm} function. 请注意，错误在{tdm}函数结束时出现。

#Read in data
policy.HTML.page <- readLines("http://policy.unt.edu/policy/3-5")

#Obtain text and remove mark-up
policy.HTML.page[186:202]
id.1 <- 3 + which(policy.HTML.page == "                    TOTAL UNIVERSITY        </div>")
id.2 <- id.1 + 5
text.data <- policy.HTML.page[id.1:id.2]
td.1 <- gsub(pattern = "<p>", replacement = "", x = text.data, 
     ignore.case = TRUE, perl = FALSE, fixed = FALSE, useBytes = FALSE)

td.2 <- gsub(pattern = "</p>", replacement = "", x = td.1, ignore.case = TRUE,
     perl = FALSE, fixed = FALSE, useBytes = FALSE)

text.d <- td.2; rm(text.data, td.1, td.2)

#Create corpus and clean 
library(tm)
library(SnowballC)
txt <- VectorSource(text.d); rm(text.d)
txt.corpus <- Corpus(txt)
txt.corpus <- tm_map(txt.corpus, tolower)
txt.corpus <- tm_map(txt.corpus, removeNumbers)
txt.corpus <- tm_map(txt.corpus, removePunctuation)
txt.corpus <- tm_map(txt.corpus, removeWords, stopwords("english"))
txt.corpus <- tm_map(txt.corpus, stripWhitespace); #inspect(docs[1])
txt.corpus <- tm_map(txt.corpus, stemDocument)

# NOTE ERROR WHEN CREATING TDM
tdm <- TermDocumentMatrix(txt.corpus)

Answer 1

The link provided by jazzurro points to the solution. jazzurro提供的链接指向解决方案。 The following line of code 以下代码行

 txt.corpus <- tm_map(txt.corpus, tolower)

must be changed to 必须改为

 txt.corpus <- tm_map(txt.corpus, content_transformer(tolower))

Answer 2

There are 2 reasons for this issue in tm v0.6. 在tm v0.6中有两个原因导致此问题。

If you are doing term level transformations like tolower etc., tm_map returns character vector instead of PlainTextDocument . 如果您正在进行诸如tolower等术语级别转换，则tm_map将返回字符向量而不是PlainTextDocument 。
Solution : Call tolower through content_transformer or call tm_map(corpus, PlainTextDocument) immediately after tolower 解决方案 ：通过content_transformer调用tolower或在tolower之后立即调用tm_map(corpus, PlainTextDocument)
If the SnowballC package is not installed and if you are trying to stem the documents then also this can occur. 如果没有安装SnowballC软件包，并且您试图阻止文档，那么也可能发生这种情况。
Solution : install.packages('SnowballC') 解决方案 ： install.packages('SnowballC')

Answer 3

There is No need to apply content_transformer. 无需应用content_transformer。

Create the corpus in this way: 以这种方式创建语料库：

trainData_corpus <- Corpus((VectorSource(trainData$Comments)))

Try it. 试试吧。

R中的TermDocumentMatrix错误

问题描述

3 个解决方案

解决方案1
27 已采纳 2014-08-28 15:05:15

解决方案2
5 2015-04-16 16:25:52

解决方案3
1 2017-04-17 05:31:51

R中的TermDocumentMatrix错误

问题描述

3 个解决方案

解决方案1 27 已采纳 2014-08-28 15:05:15

解决方案2 5 2015-04-16 16:25:52

解决方案3 1 2017-04-17 05:31:51

解决方案1
27 已采纳 2014-08-28 15:05:15

解决方案2
5 2015-04-16 16:25:52

解决方案3
1 2017-04-17 05:31:51