簡體   English   中英

tm_map和stopwords無法從R中創建的語料庫中刪除不需要的單詞

[英]tm_map and stopwords failed to remove unwanted words from the corpus created in R

我有一個結果數據框,其中包含以下數據:

                   word freq
credit           credit  790
account         account  451
xxxxxxxx       xxxxxxxx  430
report           report  405
information information  368
reporting     reporting  345
consumer       consumer  331
accounts       accounts  300
debt               debt  170
company         company  152
xxxxxx         xxxxxx    147

我想做以下事情:

  • 刪除所有具有兩個以上x的wods,例如xx,xxx,xxx等,因為這些單詞可以是大寫或大寫,所以必須先將小寫字母移入然后刪除

我使用tm_map來刪除停用詞,但似乎它沒有用,我仍然在數據幀中得到了不需要的單詞,如上所述。

myCorpus <- Corpus(VectorSource(df$txt))
myStopwords <- c(stopwords('english'),"xxx", "xxxx", "xxxxx", 
                 "XXX", "XXXX", "XXXXX", "xxxx", "xxx", "xx", "xxxxxxxx",
                 "xxxxxxxx", "XXXXXX", "xxxxxx", "XXXXXXX", "xxxxxxx", "XXXXXXXX", "xxxxxxxx")
myCorpus <- tm_map(myCorpus, tolower)
myCorpus<- tm_map(myCorpus,removePunctuation)
myCorpus <- tm_map(myCorpus, removeNumbers)
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)

myTdm <- as.matrix(TermDocumentMatrix(myCorpus))
v <- sort(rowSums(myTdm), decreasing=TRUE)
FreqMat <- data.frame(word = names(v), freq=v, row.names = F)
head(FreqMat, 10)

上面的代碼對我來說不適用於從語料庫中刪除不需要的單詞。

還有其他辦法可以解決這個問題嗎?

涉及dplyrstringr一種可能性是:

df %>%
 mutate(word = tolower(word)) %>%
 filter(str_count(word, fixed("x")) <= 1)

         word freq
1      credit  790
2     account  451
3      report  405
4 information  368
5   reporting  345
6    consumer  331
7    accounts  300
8        debt  170
9     company  152

或者使用類似邏輯的base R可能性:

df[sapply(df[, 1], 
          function(x) length(grepRaw("x", tolower(x), all = TRUE, fixed = TRUE)) <= 1, 
          USE.NAMES = FALSE), ]

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM