刪除文本語料庫中的字符

Question

我正在分析電子郵件的語料庫。 有些電子郵件包含URL。 當我從tm庫應用removePunctuation函數時，我得到httpwww ，然后丟失了網址信息。 我想什么做的，是取代"://"與" "在所有的語料。 我嘗試了gsub ，但是后來我的語料庫的數據類型發生了變化，因此我無法繼續使用tm包對其進行處理。

這是一個例子：

如您所見， gsub將語料庫的類更改為字符數組，從而導致tm_map失敗。

> corpus
# A corpus with 4257 text documents
> corpus1 <- gsub("http://","http ",corpus)
> class(corpus1)
# [1] "character"
> class(corpus)
# [1] "VCorpus" "Corpus"  "list"   
> cleanSW <- tm_map(corpus1,removeWords, stopwords("english"))
# Error in UseMethod("tm_map", x) : 
# no applicable method for 'tm_map' applied to an object of class "character"
> cleanSW <- tm_map(corpus,removeWords, stopwords("english"))
> cleanSW
# A corpus with 4257 text documents

我該如何繞過？ 也許有一種方法可以將其從字符數組轉換回語料庫？

Answer 1

在這里找到了解決此問題的方法：使用tm（）從R中的語料庫中刪除非英語文本，語料庫（VectorSource（dat1））為我工作。

刪除文本語料庫中的字符

問題描述

1 個解決方案

解決方案1
2 已采納 2014-07-31 15:19:21

刪除文本語料庫中的字符

問題描述

1 個解決方案

解決方案1 2 已采納 2014-07-31 15:19:21

解決方案1
2 已采納 2014-07-31 15:19:21