简体   繁体   中英

Extract character string in middle of string with R

I have character strings which look something like this:

a <- c("miRNA__hsa-mir-521-3p.iso.t5:", "miRNA__hsa-mir-947b.ref.t5:")

I want to extract the middle portion only eg. hsa-mir-521-3p and hsa-mir-947b

I have tried the following so far:

a1 <- substr(a, 8,21) 
[1] "hsa-mir-521-3p" "hsa-mir-947b.r"  

this obviously does not work because my desired substrings have varying lengths

a2 <- sub('miRNA__', '', a)
[1] "hsa-mir-521-3p.iso.t5:" "hsa-mir-947b.ref.t5:"  

this works to remove the upstream string ( “miRNA__” ), but I still need to remove the downstream string

Could someone please advise what else I could try or if there is a simpler way to achieve this? I am still learning how to code with R. Thank you very much!

You haven't clearly defined the "middle portion" but based on the data shared we can extract everything between the last underscore ( "_" ) and a dot ( "." ).

sub('.*_(.*?)\\..*', '\\1', a)
#[1] "hsa-mir-521-3p" "hsa-mir-947b"  

You can try the following regex like below

> gsub(".*_|\\..*","",a)
[1] "hsa-mir-521-3p" "hsa-mir-947b" 

which removes the left-most ( .*_ ) and right-most ( \\\\..* ) parts, therefore keeping the middle part.

We could also use trimws from base R

trimws(a, whitespace = '.*_|\\..*')
#[1] "hsa-mir-521-3p" "hsa-mir-947b"  

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM