简体   繁体   English

tm_map和stopwords无法从R中创建的语料库中删除不需要的单词

[英]tm_map and stopwords failed to remove unwanted words from the corpus created in R

I have a resulting data frame which has the following data: 我有一个结果数据框,其中包含以下数据:

                   word freq
credit           credit  790
account         account  451
xxxxxxxx       xxxxxxxx  430
report           report  405
information information  368
reporting     reporting  345
consumer       consumer  331
accounts       accounts  300
debt               debt  170
company         company  152
xxxxxx         xxxxxx    147

I want to do the following: 我想做以下事情:

  • remove all the wods which has more than two x such as xx, xxx, xxx and so forth, since these words can be in lower or upper case so have to bring into lower case first then remove 删除所有具有两个以上x的wods,例如xx,xxx,xxx等,因为这些单词可以是大写或大写,所以必须先将小写字母移入然后删除

I am using tm_map for removing the stopwords but it seems, it didn't work and I still got the unwanted words in the dataframe as above. 我使用tm_map来删除停用词,但似乎它没有用,我仍然在数据帧中得到了不需要的单词,如上所述。

myCorpus <- Corpus(VectorSource(df$txt))
myStopwords <- c(stopwords('english'),"xxx", "xxxx", "xxxxx", 
                 "XXX", "XXXX", "XXXXX", "xxxx", "xxx", "xx", "xxxxxxxx",
                 "xxxxxxxx", "XXXXXX", "xxxxxx", "XXXXXXX", "xxxxxxx", "XXXXXXXX", "xxxxxxxx")
myCorpus <- tm_map(myCorpus, tolower)
myCorpus<- tm_map(myCorpus,removePunctuation)
myCorpus <- tm_map(myCorpus, removeNumbers)
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)

myTdm <- as.matrix(TermDocumentMatrix(myCorpus))
v <- sort(rowSums(myTdm), decreasing=TRUE)
FreqMat <- data.frame(word = names(v), freq=v, row.names = F)
head(FreqMat, 10)

This code above didn't work for me for removing unwanted words from corpus. 上面的代码对我来说不适用于从语料库中删除不需要的单词。

is there any other alternative to deal with this issue? 还有其他办法可以解决这个问题吗?

One possibility involving dplyr and stringr could be: 涉及dplyrstringr一种可能性是:

df %>%
 mutate(word = tolower(word)) %>%
 filter(str_count(word, fixed("x")) <= 1)

         word freq
1      credit  790
2     account  451
3      report  405
4 information  368
5   reporting  345
6    consumer  331
7    accounts  300
8        debt  170
9     company  152

Or a base R possibility using a similar logic: 或者使用类似逻辑的base R可能性:

df[sapply(df[, 1], 
          function(x) length(grepRaw("x", tolower(x), all = TRUE, fixed = TRUE)) <= 1, 
          USE.NAMES = FALSE), ]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM