Suppose I have a string x
and I want to split it like so:
x <- "(A|C|T)AG(C|T)(A|C|G|T)(A|C|G|T)(A|C|G|T)(A|C|G|T)(A|C|G|T)GCC(C|T)(A|C|G|T)(A|C|G|T)(A|C|G)"
# Desired output
[1] "(A|C|T)" "A" "G" "(C|T)" "(A|C|G|T)" "(A|C|G|T)" "(A|C|G|T)"
[8] "(A|C|G|T)" "(A|C|G|T)" "G" "C" "C" "(C|T)" "(A|C|G|T)"
[15] "(A|C|G|T)" "(A|C|G)"
I am using this splitting function, but I'm unable to split the strings not in the parenthesis. What would be the best way to approach this regex problem?
splitme <- function(x) {
x <- unlist(strsplit(x, "(?=\\()", perl=TRUE))
x <- unlist(strsplit(x, "(?<=\\))", perl=TRUE))
for (i in which(x=="(")) {
x[i+1] <- paste(x[i], x[i+1], sep="")
}
x[-which(x=="(")]
}
splitme(x)
[1] "(A|C|T)" "AG" "(C|T)" "(A|C|G|T)" "(A|C|G|T)" "(A|C|G|T)" "(A|C|G|T)" "(A|C|G|T)" "GCC"
[10] "(C|T)" "(A|C|G|T)" "(A|C|G|T)" "(A|C|G)"
Something like this should work:
> library(stringi)
> unlist(stri_extract_all_regex(x, "\\([ACGT\\|]*\\)|[ACGT]"))
[1] "(A|C|T)" "A" "G" "(C|T)" "(A|C|G|T)" "(A|C|G|T)"
[7] "(A|C|G|T)" "(A|C|G|T)" "(A|C|G|T)" "G" "C" "C"
[13] "(C|T)" "(A|C|G|T)" "(A|C|G|T)" "(A|C|G)"
\\\\([ACGT\\\\|]*\\\\)
will match everything enclosed in parentheses and [ACGT]
the remaining bases.
It looks like you'd like to split the string after each )
, and after each letter that's followed by either another letter or by a (
. If that's the behavior you'd like, you can use this:
pat <- "(?<=\\))|(?<=[[:alpha:]])(?=[[:alpha:]\\(])"
strsplit(x, pat, perl=TRUE)[[1]]
# [1] "(A|C|T)" "A" "G" "(C|T)" "(A|C|G|T)" "(A|C|G|T)"
# [7] "(A|C|G|T)" "(A|C|G|T)" "(A|C|G|T)" "G" "C" "C"
# [13] "(C|T)" "(A|C|G|T)" "(A|C|G|T)" "(A|C|G)"
To split single letters, you can just run strsplit(x, "")
. All you have to do is just make sure not to apply that to "finished" strings (ie the ones with parentheses).
y = splitme(x)
Indices = !which(grepl(y, "\\("))
y[Indices] = strsplit(y[Indices], "")
unlist(y)
[1] "(A|C|T)" "A" "G" "(C|T)" "(A|C|G|T)" "(A|C|G|T)" "(A|C|G|T)" "(A|C|G|T)"
[9] "(A|C|G|T)" "G" "C" "C" "(C|T)" "(A|C|G|T)" "(A|C|G|T)" "(A|C|G)"
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.