简体   繁体   中英

How to extract substrings in R using stringr::str_match

I have the following two strings:

x <- "chr1:625000-635000.BB_162.Adipose"
y <- "chr1:625000-635000.BB_162.combined.HMSC-ad"

With this regex I have no problem capturing parts of x

> stringr::str_match(x,"(\\w+):(\\d+)-(\\d+)\\.(\\w+)\\.(\\w+)")
     [,1]                                [,2]   [,3]     [,4]     [,5]     [,6]     
[1,] "chr1:625000-635000.BB_162.Adipose" "chr1" "625000" "635000" "BB_162" "Adipose"

What I want to do is with y to obtain this

     [,1]                                [,2]   [,3]     [,4]     [,5]     [,6]     
[1,] "chr1:625000-635000.BB_162.combined.HMSC-ad"  "chr1" "625000" "635000" "BB_162" "HMSC-ad"

With my current regex and apply for y I get this instead:

   [,1]                                 [,2]   [,3]     [,4]     [,5]     [,6]      
[1,] "chr1:625000-635000.BB_162.combined" "chr1" "625000" "635000" "BB_162" "combined"

How can I generalize my regex so that it can deal with both x and y ?

Update

S.Kalbar, your regex gave this:

> stringr::str_match(y,"(\\w+):(\\d+)-(\\d+)\\.(\\w+)\\.(\\w+)(?:\\.([A-Za-z-]+))?")
     [,1]                                         [,2]   [,3]     [,4]     [,5]     [,6]       [,7]     
[1,] "chr1:625000-635000.BB_162.combined.HMSC-ad" "chr1" "625000" "635000" "BB_162" "combined" "HMSC-ad"
> stringr::str_match(x,"(\\w+):(\\d+)-(\\d+)\\.(\\w+)\\.(\\w+)(?:\\.([A-Za-z-]+))?")
     [,1]                                [,2]   [,3]     [,4]     [,5]     [,6]      [,7]
[1,] "chr1:625000-635000.BB_162.Adipose" "chr1" "625000" "635000" "BB_162" "Adipose" NA 

What' I'd like to get is this for y :

                                          [,1]     [,2]   [,3]     [,4]     [,5]     [,6]        
[1,] "chr1:625000-635000.BB_162.combined.HMSC-ad" "chr1" "625000" "635000" "BB_162" "HMSC-ad"

And this for x :

                                   [,1]  [,2]   [,3]     [,4]     [,5]     [,6]      
[1,] "chr1:625000-635000.BB_162.Adipose" "chr1" "625000" "635000" "BB_162" "Adipose" 

Regex : (\\w+):(\\d+)-(\\d+)\\.(\\w+)(?:\\.\\w+)?(?:\\.([A-Za-z-]+))

RegEx demo

You could give the engines some tokens to split on:

(?:(?<=\\d)-(?=\\d))|(?:\\.combined\\.)|[.:]+

Broken down, this says:

(?:(?<=\\d)-(?=\\d))  # a dash between numbers
|                     # or
(?:\\.combined\\.)    # .combined. literally
|                     # or
[.:]+                 # one of . or :


In R using str_split() :

library(stringr)

x <- c("chr1:625000-635000.BB_162.Adipose", "chr1:625000-635000.BB_162.combined.HMSC-ad")
str_split(x, '(?:(?<=\\d)-(?=\\d))|(?:\\.combined\\.)|[.:]+', simplify = TRUE)

Which yields

     [,1]   [,2]     [,3]     [,4]     [,5]     
[1,] "chr1" "625000" "635000" "BB_162" "Adipose"
[2,] "chr1" "625000" "635000" "BB_162" "HMSC-ad"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM