Extract character string in middle of string with R

Question

I have character strings which look something like this:

a <- c("miRNA__hsa-mir-521-3p.iso.t5:", "miRNA__hsa-mir-947b.ref.t5:")

I want to extract the middle portion only eg. hsa-mir-521-3p and hsa-mir-947b

I have tried the following so far:

a1 <- substr(a, 8,21) 
[1] "hsa-mir-521-3p" "hsa-mir-947b.r"

this obviously does not work because my desired substrings have varying lengths

a2 <- sub('miRNA__', '', a)
[1] "hsa-mir-521-3p.iso.t5:" "hsa-mir-947b.ref.t5:"

this works to remove the upstream string ( “miRNA__” ), but I still need to remove the downstream string

Could someone please advise what else I could try or if there is a simpler way to achieve this? I am still learning how to code with R. Thank you very much!

Answer 1

You haven't clearly defined the "middle portion" but based on the data shared we can extract everything between the last underscore ( "_" ) and a dot ( "." ).

sub('.*_(.*?)\\..*', '\\1', a)
#[1] "hsa-mir-521-3p" "hsa-mir-947b"

Answer 2

You can try the following regex like below

> gsub(".*_|\\..*","",a)
[1] "hsa-mir-521-3p" "hsa-mir-947b"

which removes the left-most ( .*_ ) and right-most ( \\\\..* ) parts, therefore keeping the middle part.

Answer 3

We could also use trimws from base R

trimws(a, whitespace = '.*_|\\..*')
#[1] "hsa-mir-521-3p" "hsa-mir-947b"

Extract character string in middle of string with R

Question

3 answers

solution1
1 2020-10-22 07:39:11

solution2
1 2020-10-22 13:37:40

solution3
1 2020-10-22 20:50:01

Extract character string in middle of string with R

Question

3 answers

solution1 1 2020-10-22 07:39:11

solution2 1 2020-10-22 13:37:40

solution3 1 2020-10-22 20:50:01

solution1
1 2020-10-22 07:39:11

solution2
1 2020-10-22 13:37:40

solution3
1 2020-10-22 20:50:01