[英]R: How to count gaps at the beginning of a sequence alignment?
I am analyzing an alignment of amino acid sequences using R and need a reproducible way to figure out where the start is for each sequence. 我正在使用R分析氨基酸序列的比对,需要一种可再现的方法来弄清楚每个序列的起始位置。 My alignment can be read in as a data frame. 我的对齐方式可以作为数据框读取。 Here is a sample of 3. 这是3个示例。
alignment <- data.frame("Strains" = c("Strain.1", "Strain.2", "Strain.3"),
"Sequence" = c("MASLIYRQLLTNSYTVNLSDEIQNIGSAKSQDVTINPGPFAQTGYAPVNWGAGETNDSTTVEPLLDGPYQPTTFNPPTSYWILLAPTAEGVVIQGTNNTDRWLATILIEPNVQATNRTYNLFGQQETLLVENTSQTQWKFVDVSKTTSTGSYTQHGPLFSTPKLYAVMKFSGKIYTYNGTTPNAA-TGY-YSTTSYDTVNMTSSCDFYIIPRSQEGKCTEYINYGLPPIQNTRNVVPVALSAREIVHTRAQVNEDIVVSKTSLWKEMQYNRDITIRFKFDRTIIKAGGLGYKWSEISFKPITYQYTYTRDGEQITAHTTCSVNGVNNFSYNGGSL---------------------",
"MASLIYRQLLTNSYTVNLSDEIQNIGSAKSQDVTINPGPFAQTGYAPVNWGAGETNDSTTVEPLLDGPYQPTTFNPPTSYWILLAPTAEGVVIQGTNNTDRWLATILIEPNVQATNRTYNLFGQQETLLVENTSQTQWKFVDVSKTTSTGSYTQHGPLFSTPKLYAVMKFSGKIYTYNGTTPNAA-TGY-YSTTSYDTVNMTSSCDFYIIPRSQEGKCTEYINYGLPPIQNTRNVVPVALSAREIVHTRAQVNEDIVVSKTSLWKEMQYNRDITIRFKFDRTIIKAGGLGYKWSEISFKPITYQYTYTRDGEQITAHTTCSVNGVNNFSYNGGSLPTDFAIS--------------",
"-----------------------NIGSAKSQDVTINPGPFAQTGYAPVNWGAGETNDSTTVEPLLDGPYQPTTFNPPTSYWILLAPTVEGVVIQGTNNVDRWLATILIEPNVQATNRTYNLFGQQEILLIENTSQTQWKFVDVSKTTPTGSYTQHGPLFSTPKLYAVMKFSGKIYTYNGTTPNVT-TGY-YSTTNYDTVNMT-----------------------------------------------------"))
Each of the dashes represents a space. 每个破折号代表一个空格。 What I want to do is read through my data frame and count how many spaces are at the beginning of each sequence. 我想做的是通读我的数据帧,并计算每个序列开头有多少空格。 So far I've tried using the str_count
function. 到目前为止,我已经尝试使用str_count
函数。 For example: 例如:
alignment$shift <- str_count(alignment$Sequence, "-")
but this fails me when I have gaps downstream in my sequence. 但是当我在序列中下游有缺口时,这使我失败了。 Really I'm only interested in the gaps that occur at the beginning of the sequences. 真的,我只对序列开头出现的空白感兴趣。
I stumbled across the regex
function in a post that almost perfectly matches my problem, ( How to count the number of hyphens at the beginning of a string in javascript? ) but this is in Java and I'm not sure how to translate this to R. 我在一篇几乎完全符合我的问题的帖子中偶然发现了regex
函数(( 如何计算javascript中字符串开头的连字符数量? ),但这是Java语言,我不确定如何将其转换为R.
My questions are: 我的问题是:
1) Is it possible to have str_count
stop looking for "-" characters once it reaches a non-"-" character? 1)一旦到达非非“-”字符, str_count
是否可以停止寻找“-”字符?
2) Is there a way to use regex
or a similar function in R that outputs the length of a character match at the beginning of a string? 2)有没有办法在R中使用regex
或类似的函数来在字符串的开头输出字符匹配的长度?
You could do this... 你可以做...
alignment$Sequence <- as.character(alignment$Sequence) #in case they are factors (as above)
alignment$shift <- nchar(alignment$Sequence) - nchar(gsub("^-+", "", alignment$Sequence))
alignment$shift
[1] 0 0 23
It just counts the number of characters removed by telling gsub
to delete the start of a string (the ^
) followed by any number of spaces ( -+
). 它只是通过告诉gsub
删除字符串的开头( ^
)和任意数量的空格( -+
)来删除删除的字符数。 You could use str_replace
instead of gsub
. 您可以使用str_replace
代替gsub
。
Maybe this might help? 也许这会有所帮助? It'll return the position index of the start and end of the "---" string only if it begins at the start of the string. 仅当它从字符串的开头开始时,它才会返回“ ---”字符串的开头和结尾的位置索引。
library(stringr)
str_locate_all(string = alignment$Sequence, pattern = "^-{1,}[A-Z]")
[[1]]
start end
[[2]]
start end
[[3]]
start end
[1,] 1 24
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.