简体   繁体   中英

How to remove a particular repeating element after the first from a character vector

I have a vector of path steps and there's one particular path step that if it repeats I want to eliminate the repetitions.

For example,

my_vec = "A > A > X > B > X > X > X > C > C"

Now if 'X' repeats, then I want to eliminate all repetitions of X besides the first one, while preserving the order of the rest of the elements, such that my desired outcome is:

my_vec = "A > A > X > B > X > C > C" , where the repetitive X's are eliminated from the middle.

I tried this with a for-loop and if-else combination, such that I would detect if a previous element in the vector also contains 'X', then replace the element with NA and afterwards I could remove the NA items, but this approach does not provide the desired outcome.

I tried looking here and here , but these just filter out the unique elements, while I want to perform this action on a particular element.

Here's my code:

my_vec <- unlist(str_split(my_vec, '>') )

for (i in length(my_vec)){
if (grepl('X', my_vec[i]) & grepl('X', my_vec[i-1])) {
    steps[i] <- NA

} else {
    next()
}}
my_new_vec <- str_c(steps, collapse = '>')

However, the output is exactly the same as input and nothing is changed into NA.

1) gsub Replace any repeated sequence of X possibly followed by spaces and greater than characters with the last match in that sequence. This also works if the sequence is at the end. If we knew that the sequence was not at the end, such as in the example in the question, then we could simplify the first argument to "(X > )*"

gsub("(X[> ]*)*", "\\1", my_vec)
## [1] "A > A > X > B > X > C > C"

2) strsplit/rle If you prefer to use strsplit as in the code in the question try it in conjunction with rle . First we perform the strsplit producing as and then apply rle to get r . Now for each run of " X " change its length to 1 and invert the runs back giving the deduped version of ss as s . Finally convert to a string and remove leading and trailing whitespace.

ss <- strsplit(paste0(" ", my_vec, " "), ">")[[1]]
r <- rle(ss)
r$lengths[r$values == " X "] <- 1
s <- inverse.rle(r)
trimws(paste(s, collapse = ">"))
##  "A > A > X > B > X > C > C"

(2a) Another approach also using strsplit is the following. The first and last lines of code here are the same as the first and last lines of code in (2).

ss <- strsplit(paste0(" ", my_vec, " "), ">")[[1]]
s <- ss[!c(FALSE, ss[-1] == ss[-length(ss)] & ss[-1] == " X ")]
trimws(paste(s, collapse = ">"))
##  "A > A > X > B > X > C > C"

UPDATE: Handle case where sequence is at the end and add (2) and (2a).

We can use gsub

gsub("(?:X > )\\K(X > )\\1*", "", my_vec, perl = TRUE)
#[1] "A > A > X > B > X > C > C"

A solution without regular expression. my_vec4 is the final output.

# Create example string
my_vec <- "A > A > X > B > X > X > X > C > C"

library(dplyr)

# Split my_vec by " > "
my_vec2 <- strsplit(my_vec, split = " > ")[[1]]

# Same as the previous one and equal to X
X_logi <- my_vec2 == dplyr::lag(my_vec2) & my_vec2 %in% "X"

# Subset my_vec2 if X_logi is false
my_vec3 <- my_vec2[!X_logi]

# Concatenate my_vec3
my_vec4 <- paste(my_vec3, collapse = " > ")
let str = "A > A > X > B > X > X > X > C > C";
let result = str.replace(/(\s*X >)+/g, " X >");

console.log(result);  // A > A > X > B > X > C > C

Translated to R this would be: gsub("(\\s*X >)+", " X >", my_vec) – G. Grothendieck

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM