tm_map和stopwords無法從R中創建的語料庫中刪除不需要的單詞

Question

我有一個結果數據框，其中包含以下數據：

                   word freq
credit           credit  790
account         account  451
xxxxxxxx       xxxxxxxx  430
report           report  405
information information  368
reporting     reporting  345
consumer       consumer  331
accounts       accounts  300
debt               debt  170
company         company  152
xxxxxx         xxxxxx    147

我想做以下事情：

刪除所有具有兩個以上x的wods，例如xx，xxx，xxx等，因為這些單詞可以是大寫或大寫，所以必須先將小寫字母移入然后刪除

我使用tm_map來刪除停用詞，但似乎它沒有用，我仍然在數據幀中得到了不需要的單詞，如上所述。

myCorpus <- Corpus(VectorSource(df$txt))
myStopwords <- c(stopwords('english'),"xxx", "xxxx", "xxxxx", 
                 "XXX", "XXXX", "XXXXX", "xxxx", "xxx", "xx", "xxxxxxxx",
                 "xxxxxxxx", "XXXXXX", "xxxxxx", "XXXXXXX", "xxxxxxx", "XXXXXXXX", "xxxxxxxx")
myCorpus <- tm_map(myCorpus, tolower)
myCorpus<- tm_map(myCorpus,removePunctuation)
myCorpus <- tm_map(myCorpus, removeNumbers)
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)

myTdm <- as.matrix(TermDocumentMatrix(myCorpus))
v <- sort(rowSums(myTdm), decreasing=TRUE)
FreqMat <- data.frame(word = names(v), freq=v, row.names = F)
head(FreqMat, 10)

上面的代碼對我來說不適用於從語料庫中刪除不需要的單詞。

還有其他辦法可以解決這個問題嗎？

Answer 1

涉及dplyr和stringr一種可能性是：

df %>%
 mutate(word = tolower(word)) %>%
 filter(str_count(word, fixed("x")) <= 1)

         word freq
1      credit  790
2     account  451
3      report  405
4 information  368
5   reporting  345
6    consumer  331
7    accounts  300
8        debt  170
9     company  152

或者使用類似邏輯的base R可能性：

df[sapply(df[, 1], 
          function(x) length(grepRaw("x", tolower(x), all = TRUE, fixed = TRUE)) <= 1, 
          USE.NAMES = FALSE), ]

tm_map和stopwords無法從R中創建的語料庫中刪除不需要的單詞

問題描述

1 個解決方案

解決方案1
3 已采納 2019-08-26 11:09:08

tm_map和stopwords無法從R中創建的語料庫中刪除不需要的單詞

問題描述

1 個解決方案

解決方案1 3 已采納 2019-08-26 11:09:08

解決方案1
3 已采納 2019-08-26 11:09:08