简体   繁体   中英

Split character vector with vector of patterns in R

I'm trying to write a function that builds a matrix by splitting a character vector repeatedly using successive elements in a vector of patterns.

Let's call the function I'm trying to write str_split_vector() . Here's an example of the output I'm looking for:

char <- c("A & P | B & C @ D",
          "E & Q | F & G @ H",
          "I & R | J & K @ L")
splits <- c(" \\| ", " & ", " @ ")

str_split_vector(char, splits)
#      [,1]     [,2] [,3] [,4]
# [1,] "A & P"  "B"  "C"  "D" 
# [2,] "E & Q"  "F"  "G"  "H" 
# [3,] "I & R"  "J"  "K"  "L" 

The char vector is split by each pattern in turn, leaving "A & P" intact. (Although it might be easiest to manage that last bit with particular regex patterns.)

I've been able to accomplish this task only iteratively, with a pretty ad hoc loop:

for(ii in 1:length(splits)) {
  if(ii == 1) {

    char_mat <- matrix(char)
    char_mat <- do.call(rbind, strsplit(char_mat[ , ii], splits[ii]))

  } else {

    char_mat <- cbind(char_mat[ , 1:ii - 1],
                      do.call(rbind, 
                              strsplit(char_mat[ , ii], splits[ii])
                              )
                      )
  }
}

That process looks inefficient to me, since I'm "growing" char_mat with the repeated cbind() calls. Even worse, I find it almost impossible to understand what's going on without actually running the code.

Is there a simpler way to write this, potentially ignoring the requirement that "A & P" not be split?

Maybe the following is what you want. No loops.

str_split_vector <- function(x, y){
    s <- strsplit(x, paste(y, collapse = "|"))
    do.call(rbind, s)
}

str_split_vector(char, splits)
#     [,1] [,2] [,3] [,4] [,5]
#[1,] "A"  "P"  "B"  "C"  "D" 
#[2,] "E"  "Q"  "F"  "G"  "H" 
#[3,] "I"  "R"  "J"  "K"  "L"

An approach that uses grouping and won't perform any splitting on the first & is the following:

do.call(rbind, strsplit(gsub("(.*) \\| (.*) & (.*) @ (.*)", "\\1_\\2_\\3_\\4", char), "_"))

It basically replaces the characters you wish to split on with an underscore and then splits on those underscores.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM