R的tm包出現問題

Question

我一直在嘗試遵循Udemy教程，使用R中的tm包對推文進行文本挖掘。

到目前為止，本教程（以及cran.org上的tm pdf）中指定的許多功能都導致了一系列錯誤，我不清楚如何解決它們。 我正在RStudio版本1.0.143和macOS Sierra中進行編碼。 下面的代碼和錯誤是我通過一系列推文嘗試制作wordcloud的嘗試：

nyttweets <- searchTwitter("#NYT", n=1000)
nytlist <- sapply(nyttweets, function(x) x$getText())
nytcorpus <- Corpus(VectorSource(nytlist))

這是我遇到的第一個錯誤

nytcorpus <- tm_map(nytcorpus, tolower)
**Warning message:
In mclapply(content(x), FUN, ...) :
all scheduled cores encountered errors in user code**

我看到了執行以下操作的建議，這導致了另一個錯誤

nytcorpus <- tm_map(nytcorpus, tolower, mc.cores=1)
**Error in FUN(X[[1L]], ...) : invalid multibyte string 1**

如果我改為在tolower和其他后續函數之后使用“ lazy = TRUE”，則不會收到錯誤消息：但是，當我最終嘗試構造wordcloud時，我遇到了很多錯誤：

library("twitteR")
library('wordcloud')
library('SnowballC')
library('tm')
nytcorpus <- tm_map(nytcorpus, tolower, lazy=TRUE)
nytcorpus <- tm_map(nytcorpus, removePunctuation, lazy=TRUE)
nytcorpus <- tm_map(nytcorpus, function(x) removeWords(x, stopwords()), 
lazy=TRUE)
nytcorpus <- tm_map(nytcorpus, PlainTextDocument)
wordcloud(nytcorpus, min.freq=4, scale=c(5,1), random.color=F, max.word=45, 
random.order=F)
**Warning messages:
1: In wordcloud(nytcorpus, min.freq = 4, scale = c(5, 1), random.color = F,  :
'removewords' could not be fit on page. It will not be plotted.
2: In wordcloud(nytcorpus, min.freq = 4, scale = c(5, 1), random.color = F,  :
"try-error" could not be fit on page. It will not be plotted.
3: In wordcloud(nytcorpus, min.freq = 4, scale = c(5, 1), random.color = F,  :
applicable could not be fit on page. It will not be plotted.
4: In wordcloud(nytcorpus, min.freq = 4, scale = c(5, 1), random.color = F,  :
object could not be fit on page. It will not be plotted.
5: In wordcloud(nytcorpus, min.freq = 4, scale = c(5, 1), random.color = F,  :
usemethod("removewords", could not be fit on page. It will not be plotted.**

我不確定wordcloud為什么要嘗試繪制“ removewords”或“ try-error”之類的實際功能詞，而不是NYT推文中的詞。 例如，我已經看到了將功能包裝在content_transformer中的建議

nytcorpus <- tm_map(nytcorpus, content_transformer(tolower))

但是，我再次遇到錯誤“所有計划的內核在用戶”代碼中遇到錯誤”。

這一切都非常令人沮喪，而且我不確定是否應該完全使用tm軟件包來報廢，特別是如果那里有更好的東西。 任何建議，不勝感激。

Answer 1

tm最近一直在嘗試提高其速度，並且似乎是涉及Rcpp的一項重大改進，而Rcpp最初並不是使用該軟件包構建的。 也許您查看的教程是基於舊版本的tm的，這可能是您遇到問題的原因之一。

我會嘗試一下Quanteda 。

http://quanteda.io/

主要原因是它的速度要快幾個數量級（盡管如上所述，這可能最近有所改變）。 Quanteda建立在stringi和data.table之上，而stringi和data.table在C ++和C中已高度優化。從本質上講， Quanteda利用了迄今為止可用的一些最快的R編程的工作。 以我的經驗，它也更穩定，這取決於它所依賴的軟件包的成熟度。

正如您將很快發現的那樣，在構建和分析文檔術語矩陣時，速度確實很重要，特別是如果您創建各種長度的n-gram。 因此，最好與您能找到的最快的軟件包一起使用。

賈斯汀

R的tm包出現問題

問題描述

1 個解決方案

解決方案1
1 2017-06-28 18:47:55

R的tm包出現問題

問題描述

1 個解決方案

解決方案1 1 2017-06-28 18:47:55

解決方案1
1 2017-06-28 18:47:55