简体   繁体   English

努力根据模式删除单词(R 中的文本分析)

[英]Struggling with removing words based on pattern (text analysis in R)

I'm new to text analysis.我是文本分析的新手。 I have been struggling with a particular problem in R this past week.上周我一直在努力解决 R 中的一个特定问题。 I am trying to figure out how to remove or replace all variations of a word in a string.我想弄清楚如何删除或替换字符串中单词的所有变体。 For example, if the string is:例如,如果字符串是:

test <- c("development", "develop", "developing", "developer", "apples", "kiwi")

I want the end output to be:我希望最终输出是:

"apples", "kiwi"

So, basically, I'm trying to figure out how to remove or replace all words beginning with "^develop".所以,基本上,我试图弄清楚如何删除或替换所有以“^develop”开头的单词。 I have tried using str_remove_all in the stringr package using this expression:我曾尝试使用以下表达式在 stringr 包中使用 str_remove_all :

str_remove_all(test, "^dev")

But the end result was this:但最终的结果是这样的:

"elopment", "elop", "eloping", "eloper", "apples", "kiwi"

It only removed parts of the word that matched the beginning expression "dev", whereas I want to remove the entire word if it matches the beginning of "dev".它只删除了与开头表达式“dev”匹配的部分单词,而如果它与“dev”的开头匹配,我想删除整个单词。

Thanks!谢谢!

过滤器(函数(x)!any(grepl(“开发”,x)),测试)

Use grep with invert:将 grep 与反转一起使用:

grep("^develop", test, invert = TRUE, value = TRUE)
## [1] "apples" "kiwi"  

or negate grepl:或否定 grepl:

ok <- !grepl("^develop", test)
test[ok]

or remove develop and then retrieve those elements that have not changed:或者删除develop然后检索那些没有改变的元素:

test[sub("^develop", "", test) == test]

通过stringr ,您可以执行以下操作:

stringr::str_subset(test, "^dev", negate = TRUE)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM