I'm analyzing a corpus of emails. Some emails contain URLs. When I apply the removePunctuation
function from the tm library, I get httpwww
, and then I lose the info of a web address. What I would like to do, is to replace the "://"
with " "
across all of the corpus. I tried gsub
, but then I the datatype of the corpus changes and I can't continue to process it with tm package.
Here is an example:
As you can see, gsub
changes the class of the corpus to an array of characters, causing tm_map
to fail.
> corpus
# A corpus with 4257 text documents
> corpus1 <- gsub("http://","http ",corpus)
> class(corpus1)
# [1] "character"
> class(corpus)
# [1] "VCorpus" "Corpus" "list"
> cleanSW <- tm_map(corpus1,removeWords, stopwords("english"))
# Error in UseMethod("tm_map", x) :
# no applicable method for 'tm_map' applied to an object of class "character"
> cleanSW <- tm_map(corpus,removeWords, stopwords("english"))
> cleanSW
# A corpus with 4257 text documents
How can I bypass it? Maybe there's a way to convert it back to corpus from array of characters?
在这里找到了解决此问题的方法: 使用tm()从R中的语料库中删除非英语文本 ,语料库(VectorSource(dat1))为我工作。
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.