R：如何在序列比对开始时计算缺口？

Question

I am analyzing an alignment of amino acid sequences using R and need a reproducible way to figure out where the start is for each sequence. 我正在使用R分析氨基酸序列的比对，需要一种可再现的方法来弄清楚每个序列的起始位置。 My alignment can be read in as a data frame. 我的对齐方式可以作为数据框读取。 Here is a sample of 3. 这是3个示例。

alignment <- data.frame("Strains" = c("Strain.1", "Strain.2", "Strain.3"),
               "Sequence" = c("MASLIYRQLLTNSYTVNLSDEIQNIGSAKSQDVTINPGPFAQTGYAPVNWGAGETNDSTTVEPLLDGPYQPTTFNPPTSYWILLAPTAEGVVIQGTNNTDRWLATILIEPNVQATNRTYNLFGQQETLLVENTSQTQWKFVDVSKTTSTGSYTQHGPLFSTPKLYAVMKFSGKIYTYNGTTPNAA-TGY-YSTTSYDTVNMTSSCDFYIIPRSQEGKCTEYINYGLPPIQNTRNVVPVALSAREIVHTRAQVNEDIVVSKTSLWKEMQYNRDITIRFKFDRTIIKAGGLGYKWSEISFKPITYQYTYTRDGEQITAHTTCSVNGVNNFSYNGGSL---------------------",
                 "MASLIYRQLLTNSYTVNLSDEIQNIGSAKSQDVTINPGPFAQTGYAPVNWGAGETNDSTTVEPLLDGPYQPTTFNPPTSYWILLAPTAEGVVIQGTNNTDRWLATILIEPNVQATNRTYNLFGQQETLLVENTSQTQWKFVDVSKTTSTGSYTQHGPLFSTPKLYAVMKFSGKIYTYNGTTPNAA-TGY-YSTTSYDTVNMTSSCDFYIIPRSQEGKCTEYINYGLPPIQNTRNVVPVALSAREIVHTRAQVNEDIVVSKTSLWKEMQYNRDITIRFKFDRTIIKAGGLGYKWSEISFKPITYQYTYTRDGEQITAHTTCSVNGVNNFSYNGGSLPTDFAIS--------------",
                 "-----------------------NIGSAKSQDVTINPGPFAQTGYAPVNWGAGETNDSTTVEPLLDGPYQPTTFNPPTSYWILLAPTVEGVVIQGTNNVDRWLATILIEPNVQATNRTYNLFGQQEILLIENTSQTQWKFVDVSKTTPTGSYTQHGPLFSTPKLYAVMKFSGKIYTYNGTTPNVT-TGY-YSTTNYDTVNMT-----------------------------------------------------"))

Each of the dashes represents a space. 每个破折号代表一个空格。 What I want to do is read through my data frame and count how many spaces are at the beginning of each sequence. 我想做的是通读我的数据帧，并计算每个序列开头有多少空格。 So far I've tried using the str_count function. 到目前为止，我已经尝试使用str_count函数。 For example: 例如：

alignment$shift <- str_count(alignment$Sequence, "-")

but this fails me when I have gaps downstream in my sequence. 但是当我在序列中下游有缺口时，这使我失败了。 Really I'm only interested in the gaps that occur at the beginning of the sequences. 真的，我只对序列开头出现的空白感兴趣。

I stumbled across the regex function in a post that almost perfectly matches my problem, ( How to count the number of hyphens at the beginning of a string in javascript? ) but this is in Java and I'm not sure how to translate this to R. 我在一篇几乎完全符合我的问题的帖子中偶然发现了regex函数（（如何计算javascript中字符串开头的连字符数量？），但这是Java语言，我不确定如何将其转换为R.

My questions are: 我的问题是：

1) Is it possible to have str_count stop looking for "-" characters once it reaches a non-"-" character? 1）一旦到达非非“-”字符， str_count是否可以停止寻找“-”字符？

2) Is there a way to use regex or a similar function in R that outputs the length of a character match at the beginning of a string? 2）有没有办法在R中使用regex或类似的函数来在字符串的开头输出字符匹配的长度？

Answer 1

You could do this... 你可以做...

alignment$Sequence <- as.character(alignment$Sequence) #in case they are factors (as above)

alignment$shift <- nchar(alignment$Sequence) - nchar(gsub("^-+", "", alignment$Sequence))

alignment$shift
[1]  0  0 23

It just counts the number of characters removed by telling gsub to delete the start of a string (the ^ ) followed by any number of spaces ( -+ ). 它只是通过告诉gsub删除字符串的开头（ ^ ）和任意数量的空格（ -+ ）来删除删除的字符数。 You could use str_replace instead of gsub . 您可以使用str_replace代替gsub 。

Answer 2

Maybe this might help? 也许这会有所帮助？ It'll return the position index of the start and end of the "---" string only if it begins at the start of the string. 仅当它从字符串的开头开始时，它才会返回“ ---”字符串的开头和结尾的位置索引。

library(stringr)

str_locate_all(string = alignment$Sequence, pattern = "^-{1,}[A-Z]")
[[1]]
     start end

[[2]]
     start end

[[3]]
     start end
[1,]     1  24

R：如何在序列比对开始时计算缺口？

问题描述

2 个解决方案

解决方案1
1 已采纳 2018-04-06 15:58:10

解决方案2
1 2018-04-06 16:40:30

R：如何在序列比对开始时计算缺口？

问题描述

2 个解决方案

解决方案1 1 已采纳 2018-04-06 15:58:10

解决方案2 1 2018-04-06 16:40:30

解决方案1
1 已采纳 2018-04-06 15:58:10

解决方案2
1 2018-04-06 16:40:30