[英]Find all repeated 4-mers in a DNA Sequence - Perl
你好,
我嘗試編寫一個程序,讀取包含多個DNA序列的FASTA格式文件,識別序列中所有重復的4聚體(即,多次出現的所有4聚體),並打印出重復的4聚體以及查找它的序列的標題。 k聚體僅僅是k個核苷酸的序列(例如,“aaca”,“gacg”和“tttt”是4聚體)。
這是我的代碼:
use strict;
use warnings;
my $count = -1;
my $file = "sequences.fa";
my $seq = '';
my @header = ();
my @sequences = ();
my $line = '';
open (READ, $file) || die "Cannot open $file: $!.\n";
while ($line = <READ>){
chomp $line;
if ($line =~ /^>/){
push @header, $line;
$count++;
unless ($seq eq ''){
push @sequences, $seq;
$seq = '';
}
} else {
$seq .= $line;
}
} push @sequences, $line;
for (my $i = 0; $i <= $#sequences+1; $i++){
if ($sequences[$i] =~ /(....)(.)*\g{1}+/g){
print $header[$i], "\n", $&, "\n";
}
}
我有兩個請求:首先,我不知道如何設計我的正則表達式模式以獲得所需的輸出。 第二,不太重要的是,我確信我的代碼效率非常低,所以如果有辦法縮短代碼,請告訴我。
提前致謝!
以下是FASTA文件的示例:(請注意,序列之間有一條額外的行,原始的fasta文件不是這種情況)
> NC_001422.1腸桿菌噬菌體phiX174 sensu lato,完整基因組GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTttttttCGGATATTTCTGATGAGTCGAAAAAT CCCTTACTTGAGGATAtatataAATTATGTCTAATATTCAAACTGGCGCCGAGCGTATGCCGCATGACCT
> NC_001501.1腸桿菌噬菌體phiX184意義上拉托,完整基因組AACGGCTGGTCAGTATTTAAGGTTAGTGCTGAGGTTGACTACATCTGTTTTTAGAGACCCAGACCTTTTA TCTCACTTCTGTTACTCCAGCTTCTTCGGCACCTGTTTTACAGACACCTAAAGCTACATCGTCAACGTTA TATTTTGATAGTTTGACGGTTAATGCTGGTAATGGTgagagagaGGTTTTCTTCATTGCATTCAGATGGA TCAACGCCGCTAATCAGGTTGTTTCTGTTGGTGCTGATATTGCTTTTGATGCCGACCCTAAATTTTTTGC CTGTTTGGTTCGCTTTGAGTCTTCTTCGGTTCCGACTACCCTCCCGACTGCCTATGATGTTTATCCTTTG
> NC_001622.5腸桿菌噬菌體phiX199意義上拉托,完整基因組TTCGCTGAATCAGGTTATTAAAGAGTTGCCGAGATATTTATGTTGGTTTCATGCGGATTGGTCGTTTAAA TTGGACTTGGTGGCAAGTCTGCCGCTGATAAAGGAAAGGATAATGACCAAATCAAAGAACTCGTGATTAT CTTGCTGCTGCATTTCCTGAGCTTAATGCTTGGGAGCGTGCTGGTGCTGATGCTTCCTCTGCTGGTATGG TTGACGCCGGATTTGAGAATCAAAAATGTGAGAGAGCTTACTAAAATGCAACTGGACAATCAGAAAGAGA GATGCAAAATGAGACTCAAAAAGAGATTGCTGGCATTCAGTCGGCGACTTCACGCCAGAATACGAAAGAC CAGGTATATGCACAAAATGAGATGCTTGCTTATCAACAGAAGGAGTCTACTGCTCGCGTTGCGTCTATTA TGGAAAACACCAATCTTTCCAAGCAACAGCAGGTTTCCGAGATTATGCGCCAAATGCTTACTCAAGCTCA AACGGCTGGTCAGTATTTTACCAATGACCAAATCAAAGAAATGACTCGCAAGGTTAGTGCTGAGGTTGAC TTAGATGAGTGTTCATCAGCAAACGCAGAATCAGCGGTATGGCTCTTCTCATATTGGCGCTACTGCAAAG
我可能更喜歡解決你的問題:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
#set paragraph mode. Iterate on blank lines.
local $/ = '';
#read from STDIN or a file specified on command line,
#e.g. cat filename_here | myscript.pl
#or myscript.pl filename_here
while ( <> ) {
#capture the header line, and then remove it from our data block
my ($header) = m/\>(.*)/;
s/>.*$//;
#remove linefeeds and whitespace.
s/\s*\n\s*//g;
#use lookahead pattern, so the data isn't 'consumed' by the regex.
my @sequences = m/(?=([atcg]{4}))/gi;
#increment a count for each sequence found.
my %count_of;
$count_of{$_}++ for @sequences;
#print output. (Modify according to specific needs.
print $header,"\n";
print "Found sequences:\n";
print Dumper \@sequences;
print "Count:\n";
print Dumper \%count_of;
#note - ordered, but includes duplicates.
#you could just use keys %count_of, but that would be unordered.
foreach my $sequence ( grep { $count_of{$_} > 1 } @sequences ) {
print $sequence, " => ", $count_of{$sequence},"\n";
}
print "\n";
}
我們按記錄迭代記錄,捕獲並刪除“標題”行,然后將其余部分拼接在一起。 然后捕獲4的每個(重疊)序列,並對它們進行計數。
這樣,對於您的樣本數據(簡潔的第一節):
NC_001422.1 Enterobacteria phage phiX174 sensu lato, complete genome
Found sequences:
GAGT => 2
AGTT => 2
TTAT => 2
CATG => 2
ATGA => 3
TGAC => 2
CGCA => 2
AGTT => 2
ACTT => 2
tttt => 3
tttt => 3
tttt => 3
GGAT => 2
GATA => 2
ATAT => 2
TATT => 2
ATGA => 3
TGAG => 2
GAGT => 2
AAAA => 2
AAAA => 2
ACTT => 2
TGAG => 2
GGAT => 2
GATA => 2
tata => 2
tata => 2
TTAT => 2
TATG => 2
ATAT => 2
TATT => 2
GCCG => 2
TATG => 2
GCCG => 2
CGCA => 2
CATG => 2
ATGA => 3
TGAC => 2
注意 - 因為它基於原始序列,它基於數據中的排序,你會看到TGAC兩次因為......它在那里兩次。
但是你可以改為:
foreach my $sequence ( sort { $count_of{$b} <=> $count_of{$a} }
grep { $count_of{$_} > 1 }
keys %count_of ) {
print $sequence, " => ", $count_of{$sequence},"\n";
}
print "\n";
哪個將丟棄任何少於2個匹配,並按頻率排序。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.