简体   繁体   中英

Problems with R's tm package

I've been trying to follow along with a Udemy tutorial, using the tm package in R to do text mining on tweets.

So far, many of the functions specified in the tutorial (and in the tm pdf on cran.org) result in a series of errors, and I'm unclear how to resolve them. I'm coding in RStudio Version 1.0.143 and macOS Sierra. The code and errors are below are from my attempt to make a wordcloud from a series of tweets:

nyttweets <- searchTwitter("#NYT", n=1000)
nytlist <- sapply(nyttweets, function(x) x$getText())
nytcorpus <- Corpus(VectorSource(nytlist))

Here's where I encounter the first error

nytcorpus <- tm_map(nytcorpus, tolower)
**Warning message:
In mclapply(content(x), FUN, ...) :
all scheduled cores encountered errors in user code**

I saw the suggestion to do the following, which results in another error

nytcorpus <- tm_map(nytcorpus, tolower, mc.cores=1)
**Error in FUN(X[[1L]], ...) : invalid multibyte string 1**

If I instead use 'lazy=TRUE' after tolower and the other subsequent functions I run, I don't receive an error: However, when I finally try to construct the wordcloud I run into a large amount of errors:

library("twitteR")
library('wordcloud')
library('SnowballC')
library('tm')
nytcorpus <- tm_map(nytcorpus, tolower, lazy=TRUE)
nytcorpus <- tm_map(nytcorpus, removePunctuation, lazy=TRUE)
nytcorpus <- tm_map(nytcorpus, function(x) removeWords(x, stopwords()), 
lazy=TRUE)
nytcorpus <- tm_map(nytcorpus, PlainTextDocument)
wordcloud(nytcorpus, min.freq=4, scale=c(5,1), random.color=F, max.word=45, 
random.order=F)
**Warning messages:
1: In wordcloud(nytcorpus, min.freq = 4, scale = c(5, 1), random.color = F,  :
'removewords' could not be fit on page. It will not be plotted.
2: In wordcloud(nytcorpus, min.freq = 4, scale = c(5, 1), random.color = F,  :
"try-error" could not be fit on page. It will not be plotted.
3: In wordcloud(nytcorpus, min.freq = 4, scale = c(5, 1), random.color = F,  :
applicable could not be fit on page. It will not be plotted.
4: In wordcloud(nytcorpus, min.freq = 4, scale = c(5, 1), random.color = F,  :
object could not be fit on page. It will not be plotted.
5: In wordcloud(nytcorpus, min.freq = 4, scale = c(5, 1), random.color = F,  :
usemethod("removewords", could not be fit on page. It will not be plotted.**

I'm not sure why the function, wordcloud is trying to plot the actual function words like 'removewords' or 'try-error', rather than words from the NYT tweets. I've seen suggestions to wrap the functions in content_transformer, for example

nytcorpus <- tm_map(nytcorpus, content_transformer(tolower))

However, I again just get the error 'all scheduled cores encountered errors in user 'code'.

This is all exceedingly frustrating, and I'm not sure if I should scrap using the tm package altogether, especially if there's something better out there. Any suggestions are greatly appreciated.

tm has recently been trying to improve it's speed, and appears to be a major overhaul involving Rcpp, which the package was not originally built with. Perhaps the tutorial your viewing is based on older versions of tm, which may be part of why your running into problems.

I would give a try.

http://quanteda.io/

The main reason is that it is faster by orders of magnitude (although this may have changed recently as mentioned above). Quanteda is built on and which have been highly optimized in C++ and C. Essentially, leverages the work of some of the fastest R programming available to date. In my experience, its also more stable, which makes sense based on the maturity of the packages it depends on.

As you shall soon discover, speed really matters when constructing and analyzing document term matrices, especially If you creating n-grams of various lengths. So, its best to work with the fastest packages you can find.

Justin

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM