R How to remove any two consecutive words?

Question

How can I create a function such that one of any two consecutive words (in my case separated by an underscore) is removed without specifying the words?

## Some examples
c("ethnicity_ethnicity_selected_choice",
  "child_1_child_child_pid")
#> [1] "ethnicity_ethnicity_selected_choice" "child_1_child_child_pid"

## Output needed
c("ethnicity_selected_choice",
  "child_1_child_pid")
#> [1] "ethnicity_selected_choice" "child_1_child_pid"

^{Created on 2022-07-08 by the reprex package (v2.0.1)}

Answer 1

You could try to find:

([^_]+)(?:_\1(?=_|$))*

Replace with \1 , see an online demo

([^_]+) - A capture group to catch 1+ non-underscore characters;
(?:_\1 - An non-capture group matching an underscore and a backreference to the 1st capture group;
- (?=_|$) - A nested positive lookahead with either an underscore or end-line anchor;
- )* - Close non-capture group and match 0+ times.

library(stringr)
v <- c("ethnicity_ethnicity_selected_choice",
  "child_1_child_child_pid")
v <- str_replace_all(v, "([^_]+)(?:_\\1(?=_|$))*", "\\1")
v

Prints:

"ethnicity_selected_choice", "child_1_child_pid"

Answer 2

Another possible solution:

s <- c("ethnicity_ethnicity_selected_choice",
  "child_1_child_child_child_pid", "child_1_child_childhood_pid",
  "child_child")

gsub("(?<=_|)(\\w+)(_\\1)+(?=_|$)", "\\1", s, perl = T)

#> [1] "ethnicity_selected_choice"   "child_1_child_pid"          
#> [3] "child_1_child_childhood_pid" "child"

R How to remove any two consecutive words?

Question

2 answers

solution1
3 ACCPTED 2022-07-13 19:07:47

solution2
1 2022-07-13 19:22:15

R How to remove any two consecutive words?

Question

2 answers

solution1 3 ACCPTED 2022-07-13 19:07:47

solution2 1 2022-07-13 19:22:15

solution1
3 ACCPTED 2022-07-13 19:07:47

solution2
1 2022-07-13 19:22:15