简体   繁体   English

R:如何在序列比对开始时计算缺口?

[英]R: How to count gaps at the beginning of a sequence alignment?

I am analyzing an alignment of amino acid sequences using R and need a reproducible way to figure out where the start is for each sequence. 我正在使用R分析氨基酸序列的比对,需要一种可再现的方法来弄清楚每个序列的起始位置。 My alignment can be read in as a data frame. 我的对齐方式可以作为数据框读取。 Here is a sample of 3. 这是3个示例。

alignment <- data.frame("Strains" = c("Strain.1", "Strain.2", "Strain.3"),
               "Sequence" = c("MASLIYRQLLTNSYTVNLSDEIQNIGSAKSQDVTINPGPFAQTGYAPVNWGAGETNDSTTVEPLLDGPYQPTTFNPPTSYWILLAPTAEGVVIQGTNNTDRWLATILIEPNVQATNRTYNLFGQQETLLVENTSQTQWKFVDVSKTTSTGSYTQHGPLFSTPKLYAVMKFSGKIYTYNGTTPNAA-TGY-YSTTSYDTVNMTSSCDFYIIPRSQEGKCTEYINYGLPPIQNTRNVVPVALSAREIVHTRAQVNEDIVVSKTSLWKEMQYNRDITIRFKFDRTIIKAGGLGYKWSEISFKPITYQYTYTRDGEQITAHTTCSVNGVNNFSYNGGSL---------------------",
                 "MASLIYRQLLTNSYTVNLSDEIQNIGSAKSQDVTINPGPFAQTGYAPVNWGAGETNDSTTVEPLLDGPYQPTTFNPPTSYWILLAPTAEGVVIQGTNNTDRWLATILIEPNVQATNRTYNLFGQQETLLVENTSQTQWKFVDVSKTTSTGSYTQHGPLFSTPKLYAVMKFSGKIYTYNGTTPNAA-TGY-YSTTSYDTVNMTSSCDFYIIPRSQEGKCTEYINYGLPPIQNTRNVVPVALSAREIVHTRAQVNEDIVVSKTSLWKEMQYNRDITIRFKFDRTIIKAGGLGYKWSEISFKPITYQYTYTRDGEQITAHTTCSVNGVNNFSYNGGSLPTDFAIS--------------",
                 "-----------------------NIGSAKSQDVTINPGPFAQTGYAPVNWGAGETNDSTTVEPLLDGPYQPTTFNPPTSYWILLAPTVEGVVIQGTNNVDRWLATILIEPNVQATNRTYNLFGQQEILLIENTSQTQWKFVDVSKTTPTGSYTQHGPLFSTPKLYAVMKFSGKIYTYNGTTPNVT-TGY-YSTTNYDTVNMT-----------------------------------------------------"))

Each of the dashes represents a space. 每个破折号代表一个空格。 What I want to do is read through my data frame and count how many spaces are at the beginning of each sequence. 我想做的是通读我的数据帧,并计算每个序列开头有多少空格。 So far I've tried using the str_count function. 到目前为止,我已经尝试使用str_count函数。 For example: 例如:

alignment$shift <- str_count(alignment$Sequence, "-")

but this fails me when I have gaps downstream in my sequence. 但是当我在序列中下游有缺口时,这使我失败了。 Really I'm only interested in the gaps that occur at the beginning of the sequences. 真的,我只对序列开头出现的空白感兴趣。

I stumbled across the regex function in a post that almost perfectly matches my problem, ( How to count the number of hyphens at the beginning of a string in javascript? ) but this is in Java and I'm not sure how to translate this to R. 我在一篇几乎完全符合我的问题的帖子中偶然发现了regex函数(( 如何计算javascript中字符串开头的连字符数量? ),但这是Java语言,我不确定如何将其转换为R.

My questions are: 我的问题是:

1) Is it possible to have str_count stop looking for "-" characters once it reaches a non-"-" character? 1)一旦到达非非“-”字符, str_count是否可以停止寻找“-”字符?

2) Is there a way to use regex or a similar function in R that outputs the length of a character match at the beginning of a string? 2)有没有办法在R中使用regex或类似的函数来在字符串的开头输出字符匹配的长度?

You could do this... 你可以做...

alignment$Sequence <- as.character(alignment$Sequence) #in case they are factors (as above)

alignment$shift <- nchar(alignment$Sequence) - nchar(gsub("^-+", "", alignment$Sequence))

alignment$shift
[1]  0  0 23

It just counts the number of characters removed by telling gsub to delete the start of a string (the ^ ) followed by any number of spaces ( -+ ). 它只是通过告诉gsub删除字符串的开头( ^ )和任意数量的空格( -+ )来删除删除的字符数。 You could use str_replace instead of gsub . 您可以使用str_replace代替gsub

Maybe this might help? 也许这会有所帮助? It'll return the position index of the start and end of the "---" string only if it begins at the start of the string. 仅当它从字符串的开头开始时,它才会返回“ ---”字符串的开头和结尾的位置索引。

library(stringr)

str_locate_all(string = alignment$Sequence, pattern = "^-{1,}[A-Z]")
[[1]]
     start end

[[2]]
     start end

[[3]]
     start end
[1,]     1  24

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM