删除文本语料库中的字符

Question

I'm analyzing a corpus of emails. 我正在分析电子邮件的语料库。 Some emails contain URLs. 有些电子邮件包含URL。 When I apply the removePunctuation function from the tm library, I get httpwww , and then I lose the info of a web address. 当我从tm库应用removePunctuation函数时，我得到httpwww ，然后丢失了网址信息。 What I would like to do, is to replace the "://" with " " across all of the corpus. 我想什么做的，是取代"://"与" "在所有的语料。 I tried gsub , but then I the datatype of the corpus changes and I can't continue to process it with tm package. 我尝试了gsub ，但是后来我的语料库的数据类型发生了变化，因此我无法继续使用tm包对其进行处理。

Here is an example: 这是一个例子：

As you can see, gsub changes the class of the corpus to an array of characters, causing tm_map to fail. 如您所见， gsub将语料库的类更改为字符数组，从而导致tm_map失败。

> corpus
# A corpus with 4257 text documents
> corpus1 <- gsub("http://","http ",corpus)
> class(corpus1)
# [1] "character"
> class(corpus)
# [1] "VCorpus" "Corpus"  "list"   
> cleanSW <- tm_map(corpus1,removeWords, stopwords("english"))
# Error in UseMethod("tm_map", x) : 
# no applicable method for 'tm_map' applied to an object of class "character"
> cleanSW <- tm_map(corpus,removeWords, stopwords("english"))
> cleanSW
# A corpus with 4257 text documents

How can I bypass it? 我该如何绕过？ Maybe there's a way to convert it back to corpus from array of characters? 也许有一种方法可以将其从字符数组转换回语料库？

Answer 1

在这里找到了解决此问题的方法：使用tm（）从R中的语料库中删除非英语文本，语料库（VectorSource（dat1））为我工作。

删除文本语料库中的字符

问题描述

1 个解决方案

解决方案1
2 已采纳 2014-07-31 15:19:21

删除文本语料库中的字符

问题描述

1 个解决方案

解决方案1 2 已采纳 2014-07-31 15:19:21

解决方案1
2 已采纳 2014-07-31 15:19:21