简体   繁体   中英

R How to remove any two consecutive words?

How can I create a function such that one of any two consecutive words (in my case separated by an underscore) is removed without specifying the words?

## Some examples
c("ethnicity_ethnicity_selected_choice",
  "child_1_child_child_pid")
#> [1] "ethnicity_ethnicity_selected_choice" "child_1_child_child_pid"

## Output needed
c("ethnicity_selected_choice",
  "child_1_child_pid")
#> [1] "ethnicity_selected_choice" "child_1_child_pid"

Created on 2022-07-08 by the reprex package (v2.0.1)

You could try to find:

([^_]+)(?:_\1(?=_|$))*

Replace with \1 , see an online demo


  • ([^_]+) - A capture group to catch 1+ non-underscore characters;
  • (?:_\1 - An non-capture group matching an underscore and a backreference to the 1st capture group;
    • (?=_|$) - A nested positive lookahead with either an underscore or end-line anchor;
    • )* - Close non-capture group and match 0+ times.

library(stringr)
v <- c("ethnicity_ethnicity_selected_choice",
  "child_1_child_child_pid")
v <- str_replace_all(v, "([^_]+)(?:_\\1(?=_|$))*", "\\1")
v

Prints:

"ethnicity_selected_choice", "child_1_child_pid"

Another possible solution:

s <- c("ethnicity_ethnicity_selected_choice",
  "child_1_child_child_child_pid", "child_1_child_childhood_pid",
  "child_child")

gsub("(?<=_|)(\\w+)(_\\1)+(?=_|$)", "\\1", s, perl = T)

#> [1] "ethnicity_selected_choice"   "child_1_child_pid"          
#> [3] "child_1_child_childhood_pid" "child"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM