[英]R regex - splitting between parentheses
假設我有一個字符串x
,我想像這樣拆分它:
x <- "(A|C|T)AG(C|T)(A|C|G|T)(A|C|G|T)(A|C|G|T)(A|C|G|T)(A|C|G|T)GCC(C|T)(A|C|G|T)(A|C|G|T)(A|C|G)"
# Desired output
[1] "(A|C|T)" "A" "G" "(C|T)" "(A|C|G|T)" "(A|C|G|T)" "(A|C|G|T)"
[8] "(A|C|G|T)" "(A|C|G|T)" "G" "C" "C" "(C|T)" "(A|C|G|T)"
[15] "(A|C|G|T)" "(A|C|G)"
我正在使用這個分裂函數,但我無法分割不在括號中的字符串。 處理這個正則表達式問題的最佳方法是什么?
splitme <- function(x) {
x <- unlist(strsplit(x, "(?=\\()", perl=TRUE))
x <- unlist(strsplit(x, "(?<=\\))", perl=TRUE))
for (i in which(x=="(")) {
x[i+1] <- paste(x[i], x[i+1], sep="")
}
x[-which(x=="(")]
}
splitme(x)
[1] "(A|C|T)" "AG" "(C|T)" "(A|C|G|T)" "(A|C|G|T)" "(A|C|G|T)" "(A|C|G|T)" "(A|C|G|T)" "GCC"
[10] "(C|T)" "(A|C|G|T)" "(A|C|G|T)" "(A|C|G)"
這樣的事情應該有效:
> library(stringi)
> unlist(stri_extract_all_regex(x, "\\([ACGT\\|]*\\)|[ACGT]"))
[1] "(A|C|T)" "A" "G" "(C|T)" "(A|C|G|T)" "(A|C|G|T)"
[7] "(A|C|G|T)" "(A|C|G|T)" "(A|C|G|T)" "G" "C" "C"
[13] "(C|T)" "(A|C|G|T)" "(A|C|G|T)" "(A|C|G)"
\\\\([ACGT\\\\|]*\\\\)
將匹配括號中的所有內容和[ACGT]
剩余的基數。
它看起來像你想拆分后每串)
,而這之后無論是另一封信或由每個字母后(
如果這是你想要的行為,您可以使用此:
pat <- "(?<=\\))|(?<=[[:alpha:]])(?=[[:alpha:]\\(])"
strsplit(x, pat, perl=TRUE)[[1]]
# [1] "(A|C|T)" "A" "G" "(C|T)" "(A|C|G|T)" "(A|C|G|T)"
# [7] "(A|C|G|T)" "(A|C|G|T)" "(A|C|G|T)" "G" "C" "C"
# [13] "(C|T)" "(A|C|G|T)" "(A|C|G|T)" "(A|C|G)"
要拆分單個字母,您只需運行strsplit(x, "")
。 您所要做的就是確保不要將其應用於“已完成”的字符串(即帶括號的字符串)。
y = splitme(x)
Indices = !which(grepl(y, "\\("))
y[Indices] = strsplit(y[Indices], "")
unlist(y)
[1] "(A|C|T)" "A" "G" "(C|T)" "(A|C|G|T)" "(A|C|G|T)" "(A|C|G|T)" "(A|C|G|T)"
[9] "(A|C|G|T)" "G" "C" "C" "(C|T)" "(A|C|G|T)" "(A|C|G|T)" "(A|C|G)"
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.