[英]tm_map and stopwords failed to remove unwanted words from the corpus created in R
I have a resulting data frame which has the following data: 我有一个结果数据框,其中包含以下数据:
word freq
credit credit 790
account account 451
xxxxxxxx xxxxxxxx 430
report report 405
information information 368
reporting reporting 345
consumer consumer 331
accounts accounts 300
debt debt 170
company company 152
xxxxxx xxxxxx 147
I want to do the following: 我想做以下事情:
I am using tm_map for removing the stopwords but it seems, it didn't work and I still got the unwanted words in the dataframe as above. 我使用tm_map来删除停用词,但似乎它没有用,我仍然在数据帧中得到了不需要的单词,如上所述。
myCorpus <- Corpus(VectorSource(df$txt))
myStopwords <- c(stopwords('english'),"xxx", "xxxx", "xxxxx",
"XXX", "XXXX", "XXXXX", "xxxx", "xxx", "xx", "xxxxxxxx",
"xxxxxxxx", "XXXXXX", "xxxxxx", "XXXXXXX", "xxxxxxx", "XXXXXXXX", "xxxxxxxx")
myCorpus <- tm_map(myCorpus, tolower)
myCorpus<- tm_map(myCorpus,removePunctuation)
myCorpus <- tm_map(myCorpus, removeNumbers)
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)
myTdm <- as.matrix(TermDocumentMatrix(myCorpus))
v <- sort(rowSums(myTdm), decreasing=TRUE)
FreqMat <- data.frame(word = names(v), freq=v, row.names = F)
head(FreqMat, 10)
This code above didn't work for me for removing unwanted words from corpus. 上面的代码对我来说不适用于从语料库中删除不需要的单词。
is there any other alternative to deal with this issue? 还有其他办法可以解决这个问题吗?
One possibility involving dplyr
and stringr
could be: 涉及dplyr
和stringr
一种可能性是:
df %>%
mutate(word = tolower(word)) %>%
filter(str_count(word, fixed("x")) <= 1)
word freq
1 credit 790
2 account 451
3 report 405
4 information 368
5 reporting 345
6 consumer 331
7 accounts 300
8 debt 170
9 company 152
Or a base R
possibility using a similar logic: 或者使用类似逻辑的base R
可能性:
df[sapply(df[, 1],
function(x) length(grepRaw("x", tolower(x), all = TRUE, fixed = TRUE)) <= 1,
USE.NAMES = FALSE), ]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.