I have the following two strings:
x <- "chr1:625000-635000.BB_162.Adipose"
y <- "chr1:625000-635000.BB_162.combined.HMSC-ad"
With this regex I have no problem capturing parts of x
> stringr::str_match(x,"(\\w+):(\\d+)-(\\d+)\\.(\\w+)\\.(\\w+)")
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "chr1:625000-635000.BB_162.Adipose" "chr1" "625000" "635000" "BB_162" "Adipose"
What I want to do is with y
to obtain this
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "chr1:625000-635000.BB_162.combined.HMSC-ad" "chr1" "625000" "635000" "BB_162" "HMSC-ad"
With my current regex and apply for y
I get this instead:
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "chr1:625000-635000.BB_162.combined" "chr1" "625000" "635000" "BB_162" "combined"
How can I generalize my regex so that it can deal with both x
and y
?
Update
S.Kalbar, your regex gave this:
> stringr::str_match(y,"(\\w+):(\\d+)-(\\d+)\\.(\\w+)\\.(\\w+)(?:\\.([A-Za-z-]+))?")
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] "chr1:625000-635000.BB_162.combined.HMSC-ad" "chr1" "625000" "635000" "BB_162" "combined" "HMSC-ad"
> stringr::str_match(x,"(\\w+):(\\d+)-(\\d+)\\.(\\w+)\\.(\\w+)(?:\\.([A-Za-z-]+))?")
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] "chr1:625000-635000.BB_162.Adipose" "chr1" "625000" "635000" "BB_162" "Adipose" NA
What' I'd like to get is this for y
:
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "chr1:625000-635000.BB_162.combined.HMSC-ad" "chr1" "625000" "635000" "BB_162" "HMSC-ad"
And this for x
:
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "chr1:625000-635000.BB_162.Adipose" "chr1" "625000" "635000" "BB_162" "Adipose"
Regex : (\\w+):(\\d+)-(\\d+)\\.(\\w+)(?:\\.\\w+)?(?:\\.([A-Za-z-]+))
You could give the engines some tokens to split on:
(?:(?<=\\d)-(?=\\d))|(?:\\.combined\\.)|[.:]+
Broken down, this says:
(?:(?<=\\d)-(?=\\d)) # a dash between numbers
| # or
(?:\\.combined\\.) # .combined. literally
| # or
[.:]+ # one of . or :
R
using str_split()
:
library(stringr)
x <- c("chr1:625000-635000.BB_162.Adipose", "chr1:625000-635000.BB_162.combined.HMSC-ad")
str_split(x, '(?:(?<=\\d)-(?=\\d))|(?:\\.combined\\.)|[.:]+', simplify = TRUE)
Which yields
[,1] [,2] [,3] [,4] [,5]
[1,] "chr1" "625000" "635000" "BB_162" "Adipose"
[2,] "chr1" "625000" "635000" "BB_162" "HMSC-ad"
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.