How can I create a function such that one of any two consecutive words (in my case separated by an underscore) is removed without specifying the words?
## Some examples
c("ethnicity_ethnicity_selected_choice",
"child_1_child_child_pid")
#> [1] "ethnicity_ethnicity_selected_choice" "child_1_child_child_pid"
## Output needed
c("ethnicity_selected_choice",
"child_1_child_pid")
#> [1] "ethnicity_selected_choice" "child_1_child_pid"
Created on 2022-07-08 by the reprex package (v2.0.1)
You could try to find:
([^_]+)(?:_\1(?=_|$))*
Replace with \1
, see an online demo
([^_]+)
- A capture group to catch 1+ non-underscore characters; (?:_\1
- An non-capture group matching an underscore and a backreference to the 1st capture group;
(?=_|$)
- A nested positive lookahead with either an underscore or end-line anchor; )*
- Close non-capture group and match 0+ times. library(stringr)
v <- c("ethnicity_ethnicity_selected_choice",
"child_1_child_child_pid")
v <- str_replace_all(v, "([^_]+)(?:_\\1(?=_|$))*", "\\1")
v
Prints:
"ethnicity_selected_choice", "child_1_child_pid"
Another possible solution:
s <- c("ethnicity_ethnicity_selected_choice",
"child_1_child_child_child_pid", "child_1_child_childhood_pid",
"child_child")
gsub("(?<=_|)(\\w+)(_\\1)+(?=_|$)", "\\1", s, perl = T)
#> [1] "ethnicity_selected_choice" "child_1_child_pid"
#> [3] "child_1_child_childhood_pid" "child"
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.