简体   繁体   中英

R regex - splitting between parentheses

Suppose I have a string x and I want to split it like so:

x <- "(A|C|T)AG(C|T)(A|C|G|T)(A|C|G|T)(A|C|G|T)(A|C|G|T)(A|C|G|T)GCC(C|T)(A|C|G|T)(A|C|G|T)(A|C|G)"

# Desired output
[1]  "(A|C|T)"  "A"  "G"  "(C|T)"  "(A|C|G|T)"  "(A|C|G|T)"  "(A|C|G|T)"  
[8]  "(A|C|G|T)"  "(A|C|G|T)"  "G"  "C"  "C"  "(C|T)"  "(A|C|G|T)"  
[15] "(A|C|G|T)"  "(A|C|G)"  

I am using this splitting function, but I'm unable to split the strings not in the parenthesis. What would be the best way to approach this regex problem?

splitme <- function(x) {
  x <- unlist(strsplit(x, "(?=\\()", perl=TRUE))
  x <- unlist(strsplit(x, "(?<=\\))", perl=TRUE))
  for (i in which(x=="(")) {
    x[i+1] <- paste(x[i], x[i+1], sep="")
  }
  x[-which(x=="(")]
}

splitme(x)
 [1] "(A|C|T)"   "AG"        "(C|T)"     "(A|C|G|T)" "(A|C|G|T)" "(A|C|G|T)" "(A|C|G|T)" "(A|C|G|T)" "GCC"      
[10] "(C|T)"     "(A|C|G|T)" "(A|C|G|T)" "(A|C|G)"  

Something like this should work:

> library(stringi)

> unlist(stri_extract_all_regex(x, "\\([ACGT\\|]*\\)|[ACGT]"))
 [1] "(A|C|T)"   "A"         "G"         "(C|T)"     "(A|C|G|T)" "(A|C|G|T)"
 [7] "(A|C|G|T)" "(A|C|G|T)" "(A|C|G|T)" "G"         "C"         "C"        
[13] "(C|T)"     "(A|C|G|T)" "(A|C|G|T)" "(A|C|G)"  

\\\\([ACGT\\\\|]*\\\\) will match everything enclosed in parentheses and [ACGT] the remaining bases.

It looks like you'd like to split the string after each ) , and after each letter that's followed by either another letter or by a ( . If that's the behavior you'd like, you can use this:

pat <- "(?<=\\))|(?<=[[:alpha:]])(?=[[:alpha:]\\(])"
strsplit(x, pat, perl=TRUE)[[1]]
#  [1] "(A|C|T)"   "A"         "G"         "(C|T)"     "(A|C|G|T)" "(A|C|G|T)"
#  [7] "(A|C|G|T)" "(A|C|G|T)" "(A|C|G|T)" "G"         "C"         "C"        
# [13] "(C|T)"     "(A|C|G|T)" "(A|C|G|T)" "(A|C|G)" 

To split single letters, you can just run strsplit(x, "") . All you have to do is just make sure not to apply that to "finished" strings (ie the ones with parentheses).

y = splitme(x)
Indices = !which(grepl(y, "\\("))
y[Indices] = strsplit(y[Indices], "")
unlist(y)
 [1] "(A|C|T)"   "A"         "G"         "(C|T)"     "(A|C|G|T)" "(A|C|G|T)" "(A|C|G|T)" "(A|C|G|T)"
 [9] "(A|C|G|T)" "G"         "C"         "C"         "(C|T)"     "(A|C|G|T)" "(A|C|G|T)" "(A|C|G)" 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM