简体   繁体   中英

How to remove all occurrences of a word pattern but excluding a particular pattern using str_remove in R

I want to go through a vector and look for a particular string pattern (eg 'an'). If a match is found, remove the whole word, but only if that word is not a particular string pattern (eg 'orange').

So far I have come up with the following. In this example, I'm looking for the pattern 'an', and if a match is found, the whole word that that string is part of should be removed.

library(stringr)
# Create a small short data vector
    my_vec <- fruit[str_detect(fruit, "an")]

# remove all words that contain the pattern 'an'
str_remove(my_vec, "\\w*an\\w*" )

The output shows that most elements are removed (because they contain the pattern 'an'), but keeps the words "blood", "melon", and "purple" (which is as expected).

Next, I want to expand the str_remove-statement so that it does not remove the word 'orange'. So, still all words that contain "an" should be removed, but not if that word is 'orange'. The expect output is: "blood orange", "melon", and "orange".

I believe that ',' means to exclude a particular pattern. but I have not managed to get this to work. Any tips and insights are much appreciated.

You can do that in several ways:

str_remove_all(my_vec, "\\b(?!orange\\b)\\w*an\\w*" )
str_replace_all(my_vec, "\\b(orange)\\b|\\w*an\\w*", "\\1" )

See an R test:

library(stringr)
my_vec <- c("man,blood,melon,purple,orange.")
str_remove_all(my_vec, "\\b(?!orange\\b)\\w*an\\w*" )
# => [1] ",blood,melon,purple,orange."
str_replace_all(my_vec, "\\b(orange)\\b|\\w*an\\w*", "\\1" )
# => [1] ",blood,melon,purple,orange."

Details :

  • \b - a word boundary
  • (?!orange\b) - immediately to the right, there should be no orange as whole word
  • \w*an\w* - zero or more word chars, an and zero or more word chars.

In str_replace_all(my_vec, "\\b(orange)\\b|\\w*an\\w*", "\\1") , the regex matches and captures orange as a whole word and puts it into Group 1, then a whole word with an is matched, and the replacement is \1 , the backreference to Group 1.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM