[英]How to find substring from string in R?
If my string is a DNA sequence, 如果我的字符串是DNA序列,
x<-"TATAATGCAACGAGGGGCATAATTATATATGCCCAAAATCTGATATAATGACCGGGTAG"
I want to extract substring from ATG to TAA, TGA or TAG. 我想将子字符串从ATG提取到TAA,TGA或TAG。 I am able to extract from one point to another by using stringi package with regex. 通过使用带有正则表达式的stringi包,我能够从一个点提取到另一点。
My code is 我的代码是
stri_extract_all(x, regex = "ATG.*?TAA")
Help me by solving my query. 解决我的查询为我提供帮助。
I believe that you meant str_extract_all
from the stringr
package. 我相信,你的意思str_extract_all
从stringr
包。 That function does not have an argument called regex
; 该函数没有名为regex
的参数; you need pattern
. 你需要pattern
。 Once you get by that, you can just use or |
一旦达到该目的,就可以使用或|
to allow any of the sequence endings. 允许任何序列结尾。
library(stringr)
str_extract_all(x, pattern="ATG.*?(TAA|TGA|TAG)")
[[1]]
[1] "ATGCAACGAGGGGCATAA" "ATGCCCAAAATCTGA" "ATGACCGGGTAG"
Here is a possibility using Biostrings
: 这是使用Biostrings
一种可能性:
library("Biostrings")
x <- "TATAATGCAACGAGGGGCATAATTATATATGCCCAAAATCTGATATAATGACCGGGTAG"
# Get all combinations of substrings starting with "ATG" and ending with "TAA"
library(tidyverse)
df <- expand.grid(start(matchPattern("ATG", x)), end(matchPattern("TAA", x))) %>%
filter(Var1 < Var2);
ir <- IRanges(df[, 1], df[, 2]);
extractAt(BString(x), IRanges(df[, 1], df[, 2]));
#A BStringSet instance of length 3
# width seq
#[1] 18 ATGCAACGAGGGGCATAA
#[2] 44 ATGCAACGAGGGGCATAATTATATATGCCCAAAATCTGATATAA
#[3] 20 ATGCCCAAAATCTGATATAA
Since you're working with DNA sequence data, I recommend familiarising yourself with Biostrings
from Bioconductor. 由于您正在使用DNA序列数据,因此建议您熟悉Bioconductor的Biostrings
。 There exist many Bioconductor packages beyond Biostrings
that will make your life a lot easier (down the track), when you're working with DNA/RNA sequence data. 除了Biostrings
以外,还有许多Bioconductor软件包,当您处理DNA / RNA序列数据时,将使您的生活变得更加轻松( Biostrings
)。
To account for multiple stop codons, simply wrap end(matchPattern(...))
within an sapply
loop. 要考虑多个终止密码子,只需将end(matchPattern(...))
包装在一个sapply
循环中。
df <- expand.grid(
start(matchPattern("ATG", x)),
unlist(sapply(c("TAA", "TGA", "TAG"), function(ss) end(matchPattern(ss, x))))) %>%
filter(Var1 < Var2);
ir <- IRanges(df[, 1], df[, 2]);
extractAt(BString(x), IRanges(df[, 1], df[, 2]));
# [1] 18 ATGCAACGAGGGGCATAA
# [2] 44 ATGCAACGAGGGGCATAATTATATATGCCCAAAATCTGATATAA
# [3] 20 ATGCCCAAAATCTGATATAA
# [4] 39 ATGCAACGAGGGGCATAATTATATATGCCCAAAATCTGA
# [5] 15 ATGCCCAAAATCTGA
# ... ... ...
# [7] 23 ATGCCCAAAATCTGATATAATGA
# [8] 4 ATGA
# [9] 55 ATGCAACGAGGGGCATAATTATATATGCCCAAAATCTGATATAATGACCGGGTAG
#[10] 31 ATGCCCAAAATCTGATATAATGACCGGGTAG
#[11] 12 ATGACCGGGTAG
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.