[英]tm_map and stopwords failed to remove unwanted words from the corpus created in R
我有一個結果數據框,其中包含以下數據:
word freq
credit credit 790
account account 451
xxxxxxxx xxxxxxxx 430
report report 405
information information 368
reporting reporting 345
consumer consumer 331
accounts accounts 300
debt debt 170
company company 152
xxxxxx xxxxxx 147
我想做以下事情:
我使用tm_map來刪除停用詞,但似乎它沒有用,我仍然在數據幀中得到了不需要的單詞,如上所述。
myCorpus <- Corpus(VectorSource(df$txt))
myStopwords <- c(stopwords('english'),"xxx", "xxxx", "xxxxx",
"XXX", "XXXX", "XXXXX", "xxxx", "xxx", "xx", "xxxxxxxx",
"xxxxxxxx", "XXXXXX", "xxxxxx", "XXXXXXX", "xxxxxxx", "XXXXXXXX", "xxxxxxxx")
myCorpus <- tm_map(myCorpus, tolower)
myCorpus<- tm_map(myCorpus,removePunctuation)
myCorpus <- tm_map(myCorpus, removeNumbers)
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)
myTdm <- as.matrix(TermDocumentMatrix(myCorpus))
v <- sort(rowSums(myTdm), decreasing=TRUE)
FreqMat <- data.frame(word = names(v), freq=v, row.names = F)
head(FreqMat, 10)
上面的代碼對我來說不適用於從語料庫中刪除不需要的單詞。
還有其他辦法可以解決這個問題嗎?
涉及dplyr
和stringr
一種可能性是:
df %>%
mutate(word = tolower(word)) %>%
filter(str_count(word, fixed("x")) <= 1)
word freq
1 credit 790
2 account 451
3 report 405
4 information 368
5 reporting 345
6 consumer 331
7 accounts 300
8 debt 170
9 company 152
或者使用類似邏輯的base R
可能性:
df[sapply(df[, 1],
function(x) length(grepRaw("x", tolower(x), all = TRUE, fixed = TRUE)) <= 1,
USE.NAMES = FALSE), ]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.