简体   繁体   English

如何在FASTA文件中生成特定模式的直方图?

[英]How can I make a histogram of occurences of specific patterns in a FASTA file?

I have written a Perl script for the following bioinformatics question, but unfortunately there is a problem with the output. 我已经为以下生物信息学问题编写了一个Perl脚本,但不幸的是输出存在问题。

Question

1) From a file of 40,000 unique sequences, unique meaning the sequence id numbers, extract the following pattern 1)从40,000个唯一序列的文件中,序列id号的唯一含义,提取以下模式

 $gpat = [G]{3,5}; $npat = [A-Z]{1,25};<br>
 $pattern = $gpat.$npat.$gpat.$npat.$gpat.$npat.$gpat;  

2) For each sequence, find if $pattern occurs between the values of 2)对于每个序列,找出$pattern之间是否出现$pattern

  • 0-100 0-100
  • 100-200 100-200
  • 200-300 200-300
  • ... ...
  • 900-1000 900-1000
  • 1000 1000

If a certain sequence is <1000 characters long, even then the division must be maintained ie 0-100,100-200 etc. 如果某个序列的长度小于1000个字符,那么即使这样,也必须保持该分区,即0-100,100-200等。

The Issue 问题

The main issue I am having is with counting the number of times $pattern occurs for each sequence subdivision and then adding its count for all the sequences . 我遇到的主要问题是计算每个序列细分发生$ pattern的次数,然后为所有序列添加其计数

For example, for sequence 1, say $pattern occurs 5 times at a length >1000. 例如,对于序列1,假设$ pattern在长度> 1000时出现5次。 For sequence 2, say $pattern occurs 3 times at length>1000. 对于序列2,假设$ pattern出现3次> 1000。 Then total count should be 5+3 =8. 那么总计数应该是5 + 3 = 8。

Instead, my result is coming like : (5+4+3+2+1) + (3+2+1) = 21 ie a cumulative total. 相反,我的结果如下:(5 + 4 + 3 + 2 + 1)+(3 + 2 + 1)= 21即累计总数。

I am facing the same issue with the count for the first 10 subdivisions of 100 characters each. 我面临着与前10个细分的计数相同的问题,每个细分为100个字符。

I would be grateful if a correct code could be provided for this calculation. 如果能为此计算提供正确的代码,我将不胜感激。

The code I have written is as under. 我写的代码如下。 It is heavily derived from Borodin's answer to one of my previous questions here : Perl: Search a pattern across array elements 它源于Borodin对我之前的一个问题的回答: Perl:在数组元素中搜索模式

His answer is here: https://stackoverflow.com/a/11206399/1468737 他的答案在这里: https//stackoverflow.com/a/11206399/1468737

The Code : 守则

use strict;
use warnings;

my $gpat = '[G]{3,5}';
my $npat = '[A-Z]{1,25}';
my $pattern = $gpat.$npat.$gpat.$npat.$gpat.$npat.$gpat; 
my $regex = qr/$pattern/i;

open my $fh, '<', 'small.fa' or die $!;

my ($id, $seq); 
my @totals = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0); #intialize the @total arrays...  
#..it should  contain 10 parts for 10 divisions upto 1000bp
my @thousandcounts =(0); #counting total occurrences of $pattern at >1000 length

while (<$fh>) {
  chomp;

  if (/^>(\w+)/) {
    process_seq($seq) if $id;
    $id = $1;
    $seq = '';
    print "$id\n";
  }
  elsif ($id) {
    $seq .= $_;
    process_seq($seq) if eof;
  }
}

print "Totals : @totals\n";
print "Thousand Counts total : @thousandcounts\n";

##**SUBROUTINE**    

sub process_seq {

  my $sequence = shift @_;   
  my $subseq = substr $sequence,0,1000;
  my $length = length $subseq;
  print $length,"\n";

  if ($length eq 1000) {

  my @offsets = map {sprintf '%.0f', $length * $_/ 10} 1..10;
  print "Offsets of 10 divisions: @offsets\n";

  my @counts = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
  my @count = (0); 

     while ($sequence =~ /$regex/g) {
     my $place = $-[0];
     print $place,"\n\n"; 

        if ($place <=1000){
        for my $i (0..9) { 
        next if $place >= $offsets[$i];                   
        $counts[$i]++;                                    
        last;
        }       

     }
      print "Counts : @counts\n\n";

      $totals[$_] += $counts[$_] for 0..9; 



        if ($place >1000){

        for my $i(0){
        $count[$i]++;
        last;
        }




        } print "Count greater than 1000 : @count\n\n"; 

         $thousandcounts[$_] += $count[$_] for 0;


  } 

} 

   #This region of code is for those sequences whose total length is less than 1000
   #It is working great ! No issues here
   elsif ($length != 1000) {

    my $substr = join ' ', unpack '(A100)*', $sequence;

    my @offsets = map {sprintf '%.0f', $length * $_/ ($length/100)} 1..10;
    print "Offsets of 10 divisions: @offsets\n";

    my @counts = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0,);

       while ($sequence =~ /$regex/g) {
       my $place = $-[0];
       print "Place : $place","\n\n"; 

         for my $i (0..9) { 
         next if $place >= $offsets[$i];                   
         $counts[$i]++;                                    .
         last;
        }
      }
       print "Counts : @counts\n\n";

       $totals[$_] += $counts[$_] for 0..9;

  }


}#subroutine ends

I am also attaching a small segment of the file I am working with. 我还附加了我正在处理的文件的一小部分。 This one is titled small.fa and I have been experimenting with this file only before moving onto to the bigger file containing >40,000 sequences. 这个标题是small.fa ,我只是在转到包含> 40,000个序列的更大文件之前一直在试验这个文件。

>NR_037701 1
aggagctatgaatattaatgaaagtggtcctgatgcatgcatattaaaca
tgcatcttacatatgacacatgttcaccttggggtggagacttaatattt
aaatattgcaatcaggccctatacatcaaaaggtctattcaggacatgaa
ggcactcaagtatgcaatctctgtaaacccgctagaaccagtcatggtcg
gtgggctccttaccaggagaaaattaccgaaatcactcttgtccaatcaa
agctgtagttatggctggtggagttcagttagtcagcatctggtggagct
gcaagtgttttagtattgtttatttagaggccagtgcttatttagctgct
agagaaaaggaaaacttgtggcagttagaacatagtttattcttttaagt
gtagggctgcatgacttaacccttgtttggcatggccttaggtcctgttt
gtaatttggtatcttgttgccacaaagagtgtgtttggtcagtcttatga
cctctattttgacattaatgctggttggttgtgtctaaaccataaaaggg
aggggagtataatgaggtgtgtctgacctcttgtcctgtcatggctggga
actcagtttctaaggtttttctggggtcctctttgccaagagcgtttcta
ttcagttggtggaggggacttaggattttatttttagtttgcagccaggg
tcagtacatttcagtcacccccgcccagccctcctgatcctcctgtcatt
cctcacatcctgtcattgtcagagattttacagatatagagctgaatcat
ttcctgccatctcttttaacacacaggcctcccagatctttctaacccag
gacctacttggaaaggcatgctgggtctcttccacagactttaagctctc
cctacaccagaatttaggtgagtgctttgaggacatgaagctattcctcc
caccaccagtagccttgggctggcccacgccaactgtggagctggagcgg
gagggaggagtacagacatggaattttaattctgtaatccagggcttcag
ttatgtacaacatccatgccatttgatgattccaccactccttttccatc
tcccagaagcctgctttttaatgcccgcttaatattatcagagccgagcc
tggaatcaaactgcctctttcaaaacctgccactatatcctggctttgtg
acctcagccaagttgcttgactattctcagtctcagtttctgcacctgtc
aaatagggtttatgttaacctaactttcagggctgtcaggattaaatgag
catgaaccacataaaatgtttggtgtatagtaagtgtacagtaaatactt
ccattatcagtccctgcaattctatttttcttccttctctacacagcccc
tgtctggctttaaaatgtcctgccctgctttttatgagtggataccccca
gccctatgtggattagcaagttaagtaatgacactcagagacagttccat
ctttgtccataacttgctctgtgatccagtgtgcatcactcaaacagact
atctcttttctcctacaaaacagacagctgcctctcagataatgttgggg
gcataggaggaatgggaagcccgctaagagaacagaagtcaaaaacagtt
gggttctagatgggaggaggtgtgcgtgcacatgtatgtttgtgtttcag
gtcttggaatctcagcaggtcagtcacattgcagtgtgtcgcttcacctg
gctccctcttttaaagattttccttccctctttccaactccctgggtcct
ggatcctccaacagtgtcagggttagatgccttttatgggccacttgcat
tagtgtcctgatagaggcttaatcactgctcagaaactgccttctgccca
ctggcaaagggaggcaggggaaatacatgattctaattaatggtccaggc
agagaggacactcagaatttcaggactgaagagtatacatgtgtgtgatg
gtaaatgggcaaaaatcatcccttggcttctcatgcataatgcatgggca
cacagactcaaaccctctctcacacacatacacatatacattgttattcc
acacacaaggcataatcccagtgtccagtgcacatgcatacacgcacaca
ttcccttcctaggccactgtattgctttcctagggcatcttcttataaga
caccagtcgtataaggagcccaccccactcatctgagcttatcaaccaat
tacattaggaaagactgtatttcctagtaaggtcacattcagtagtactg
agggttgggacttcaacacagctttttgggggatcataattcaacccatg
acagccactgagattattatatctccagagaataaatgtgtggagttaaa
aggaagatacatgtggtacaaggggtggtaaggcaagggtaaaaggggag
ggaggggattgaactagacacagacacatgagcaggactttggggagtgt
gttttatatctgtcagatgcctagaacagcacctgaaatatgggactcaa
tcattttagtccccttctttctataagtgtgtgtgtgcggatatgtgtgc
tagatgttcttgctgtgttaggaggtgataaacatttgtccatgttatat
aggtggaaagggtcagactactaaattgtgaagacatcatctgtctgcat
ttattgagaatgtgaatatgaaacaagctgcaagtattctataaatgttc
actgttattagatattgtatgtctttgtgtccttttattcatgaattctt
gcacattatgaagaaagagtccatgtggtcagtgtcttacccggtgtagg
gtaaatgcacctgatagcaataacttaagcacacctttataatgacccta
tatggcagatgctcctgaatgtgtgtttcgagctagaaaatccgggagtg
gccaatcggagattcgtttcttatctataatagacatctgagcccctggc
ccatcccatgaaacccaggctgtagagaggattgaggccttaagttttgg
gttaaatgacagttgccaggtgtcgctcattagggaaaggggttaagtga
aaatgctgtataaactgcatgatgtttgcaggcagttgtggttttcctgc
ccagcctgccaccaccgggccatgcggatatgttgtccagcccaacacca
caggaccatttctgtatgtaagacaattctatccagcccgccacctctgg
actccctcccctgtatgtaagccctcaataaaaccccacgtctcttttgc
tggcaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaa
>NR_002714 1
gttatacatctctaccattacctagcctgaaaagccacctcagattcagc
caacaagtaagtgggcattacaggagaagggtacctttcacaagggctgt
aatctaaaatcttggggaagatacagcgtcatctgtccaagaggtgtcag
cagtaacgaagcctcagtagaagccaaagttattttggattactgagcct
gtatagtttccagattctcaagagaaatatatgggaatgtagatatctca
gaggaccttcctgctgtcaggaattcagaggaggaaataaggaaggtaat
aggtgctctgctctcattctctcaaaccctcttccctgtgttttcctata
gagattgctgatttgctccttaagcaagagattcactgctgctcagcatg
gctcagaccaactcatgcttcatgctgatctcctgcctgatgttcctgtc
tctgagccaaggtgagattgttttccccacacatacctcccacaacccca
gccctgaagccctcactctatcctcatgcatatgagttcacttgagaaaa
agcagagtcaagttcaggggttgttttgtgttgttcagtgatatttattg
ctgatctcatcccattcaaaaacatcctgacctccctaaggagttagaga
tggaacttagcataaccctttatcagtgaccactgcagttggcattggtt
tgtcatattaacactactcatgatgggggtgttgaggatgtctgtttgta
gacagtcattagtggaatggggaactgaggggagctttgtgtgtagagaa
actggacaggcttgagaaagaagcctcagtccttcaaggaagaaaaagcc
ataagtaaaagggacaatggggacacttttcatgagcctattcattgtgt
gctcttgtcttgagcaaagacatcttgagagcctataggtaagatgcaga
agggcagaagtgaccaatcgcttcgtgacctataggatccttctattcct
ataaagaatcctcagaagctcctacctcatattttagcctttaccttgcc
ctgagggtctttcttaattgtctctcttttcccaggacaggaggcccatg
ctgagttgcccaaggcccagatcagctgcccagaaggcaccagtgcctaa
ggctcccactgctactactttaatgaagagcatgagacctgggtttatgc
agatgtgagtgaggagagcagtgtgggaagggaggctcacgaagggaggg
gaagctgccactctccagtgtgttcagtggctgatatgagatgagactaa
tcccctccctatccaatcatcagcccaaaactttccaatctactttatcc
catcattcagcacagagatgctggtggtcagtgacagcatcatcagggac
atttctgtgctgtcctttttctgttacatcctctgggagggctcaatatg
tctcccacactttcctccttcactgagtgctccattttcttctccaacag
ctctactgccagaacatgaattcaggtaacctggtgtctgtgctcaccca
ggctgagggtgcctttgtggcttcgctgattaaagagagtggcaccaagg
atagcaatgtctggattggcctccatgacccccaccggatcagtctgctg
catcttctacctcctgattatcaggttccagagggtctgatgtctggcac
ctcaagcatcagtttttactatattatgataaaagcaacctctctataaa
tcatataatgtaaaggatatcaaggttctccataggttcttcgagataag
cttaaagctgaatttcctgtgtgtttcaggcattcacagataaactcatt
ctctgtacttctagggtagcatctttatgtatctattatgtacctcttat
ctattgtgttatcatctctgttatagaagagccttctgtagaccatatag
aaaaagattatagaggaggagaatctactgctggcaattgggaaccgcaa
ggtatactaaataatatatcaacaactaatggccatctaatgctatgctg
gatatgaacttttggggcctcaggaaagaaaaaccaggaactagtttcaa
taatgaggtgtcatggttccctgtggcaaatttagaacgcttatcgtttg
gcaggacacagagaggtaggtgaacattccaggaaagaagcagcttagag
aaaatgtggaggaaataatatgacacttagagaaaaaggaaggtttattc
ttgtcttatgtcttgacctgtttctgagtgcgaacacaaaccaggtgttt
ctgtctctttctgagtcacgtctgcccctgttctggcccttccccatcta
gaactgccattatcagtggagtagtgggtccctggtctcctacaaatcct
gggacattggatccccaagctgtgccaatactgcctactgtgctagcctg
acttcaagctcaggtgaggggcacagaatccacacacttattgccatcct
ctcctatttatctctgaggatcgaccggggactgggatagaggaagggtg
agctcctcattcaggaaatagaggagtgtttcctctttatttttgctgag
tcctgcagccaggagggtaatacactctgatcccctcagtctgaatcttc
tcattgtcttataggattcaagaaatggaaggatgattcttgtaaggaga
agttctcctttgtttgcaagttcaaatactggaggcaattgtaaaatgga
cgtctagaattggtctaccagttactatggagtaaaagaattaaactgga
ccatctctctccatatcaatctggaccatctctcctctgctaaatttgca
tgactgatctttagtatctttacctacctcaatttctggagccctaaaca
ataaaaataaacatgtttcccccat
>NR_003569 1
ctgggacccacgacgacagaaggcgccgatggccgcgcctgctgagccct
gcgcggggcagggggtctggaaccagacagagcctgaacctgccgccacc
agcctgctgagcctgtgcttcctgagaacagcaggggtctgggtaccccc
catgtacctctgggtccttggtcccatctacctcctcttcatccaccacc
atggccggggctacctccggatgttccccactcttcaaagccaagatggt
gcttggattcgccctcatagtcctgtgtacctccagcgtggctgtcgctc
tttggaaaatccaacagggaacgcctgaggccccagaattcctcattcat
cctactgtgtggctcaccacgatgagcttcgcagtgttcctgattcacac
caagaggaaaaagggagtccagtcatctggagtgctgtttggttactggc
ttctctgctttgtcttgccagctaccaacgctgcccagcaggcctccgga
gcgggcttccagagcgaccctgtccgccacctgtccacctacctatgcct
gtctctggtggtggcacagtttgtgctgtcctgcctggcggatcaacccc
ccttcttccctgaagacccccagcagtctaacccctgtccagagactggg
gcagccttcccctccaaagccacgttctggtgggtttctggcctggtctg
gaggggatacaggaggccactgagaccaaaagacctctggtcgcttggga
gagaaaactcctcagaagaacttgtttcccggcttgaaaaggagtggatg
aggaaccgcagtgcagcccgggggcacaacaaggcaatagcatttaaaag
gaaaggcggcagtggcatggaggctccagagactgagcccttcctacggc
aagaagggagccagtggcgcccactgctgaaggccatctggcaggtgttc
cattctaccttcctcctggggaccctcagcctcgtcatcagtgatgtctt
caggttcactgtccccaagctgctcagccttttcctggagtttattggtg
atcccaagcctccagcctggaagggctacctcctcgccgtgctgatgttc
ctctcggcctgcctgcaaacgctgtttgagcagcagaacatgtacaggct
caaggtgctgtagatgaggctgcggtcggccatcactggcctggtgtaca
gaaaggcatccacagcatatctgaagaaatattcagaagttaactaatct
cagatgatttcagcaggagtaaagaagagaaacagactcagaaatgccat
tacaacagttaattatgtcaaatttatcaccctgattgatcacgcagcat
taacctcaagaacgccaagccaagtttttttgacaaatgtgagccaaggt
ttccgaaaaactagcagatatgactgtgacttacaaaatggaaaaagtaa
acgagaaacacaatttgatatgatttaataaaagatttgtttccaccact
tctcctgggaacctcagcacattttctttccactgacagttattatctct
acctttattgaacaaagacacccggaacacagctgctgaggatcagtaaa
gaaaatcattcttttattaataagactgttattagcaggaaaaaaaaatc
catgtttgggagtttgcactgaagttacaggccattttgaagaaatatgg
ctgactagtgccaacattatttcaggcaatttcatgatcaaatgtcttat
taggttgtttaaaatttttatagagattgtaaatcagaactattttctat
ttgccctaaatatttagatgctacagggaaagcagatcaaattaaagggt
actgtgcacatttttttactgggaactcccagggatataaatcatttcgc
ctgcagcatggaattcttcagtacacatgcttgtggaaacattccacgct
ccgccagcacgctcattaaagtgatgatttgggttgcaacaacagtgcca
agtacttcctgtgttcaactggggaccatgtggcaagacccaaagcttcc
ccagagatcctatgggaataagttttttgagccaccatattccattattt
cagcctaaaataacaccatgggacaagaatcagaagacagaggagcagac
aaatgtgtgtagacatgctggaaggaatctttctttttagaaacagggtc
aatatctattaaactttaagatgtgtatctcttgacctggcagtttctgt
atttgagttttaacctactgatatacccatgcatgtgaataaagtatctt
cctgcatgtaacaggatatttaatgtaaccttgattatagttgcaaatgc
tgggaaacgatccaaatgtctttcaatatggcactgattaaataaattat
ggcacagtctcacaatgaaaaacaaatgtagccattaaacagaatgaaat
gggtctagctaaattgaaataggactacctctaagatatgttgttaaaaa
gaaaaaaaagaaagtgcagaggaacaagtatgataccattttgtattttt
taacatatgcaagcgtgattgtgcccacacagaatacctttgaaaataaa
ctcagtatttgcctcagtggataaaaacaagaaccagccttattttcact
gttatatcttttggtgccactttttgaactttttaccatatgtgcatatg
taactttctaaataaattttgtaaaaaaaaaaaaaaaaaa
>NR_002817 2
aactcggtctccactgcactgctggccagacgagggatgttattttgggc
agtgcatctggacttggttcaagtggcaccagccaaatccctgccttact
gacctctcccctggaggagcaggagcagtgctcaaggccgccctgggagg
gctgagaggcaggctctggactggggacacagggatagctgagccccagc
tgggggtggaagctgagccagggacagtcacagaggaacaagatcaagat
gcgctttaactgagaagcccccaaggcagaggctgagaatcagaagacat
ttcagcagacatctacaaatctgaaggacaaaacatggttcaagcatctg
ggcacaggcggtccacccgtggctccaaaatggtctcctggtccgtgata
gcaaagatccaggaaatatggtgcgaggaagatgagaggaagatggcgcg
agagttcctggccgagttcatgagcacatatgtcatgatggagtggctga
ccgggatgctccagctgtgtctcttcgccatcgtggaccaggagaacaac
ccagcactgccaggaacacacgcactggtgataggcatcctcgtggtcat
catcagggtgtaccatggcatgaacacaggatatgccatcaatccgtccc
gggacctgccccccccccccgcatcttcaccttcattgctggttggggca
aactggtcttcaggtactgcccctgcccaggcccattcctttgagatttt
ctgtggggcccctgtgtgttgaggtgtggggggtgatgtgaggggcagca
caggagggtcctgcagagcccccaggtggcctggggagcaggagtgagtc
ccaacatttccccaggccagtagagatacagatcctgcacctgcactgag
tgtcaaccctgtccctgagtcgggctgaggctgaccagggccccgggttg
ggggtgtttcctgggttagcctgaggatgactcctctgctcaaccagtct
tggcccgaggtggatgagggtgctgtcctgggcatcagccccctcagccg
gcctctgcctcttgcctgcagcgatggggagaacttgtggtgggtgccag
tggtggcaccacttctgggtgcctctctaggtggcatcatctacctggtc
ttcattggctccaccatcccacgggagcccctgaaattggaggactctgt
ggcatatgaagaccacgggataaccgtattgcccaagatgggatctcatg
aacccatgatctctccccttaccctcatctccgtgagccctgccaacaga
tcttcagtccaccctgccccacccttacatgaatccatggccctagagca
cttctaagcagagattatttgtgatcccatcccttccccaataaagagaa
gcttgtcccacagcagtacccccacttcctgggggcctcctgtggttggg
cttccctcctgggttcttccaggagctctagggctatgtcttagcccaag
gtgtagaggtgaggcacctcaagtctttcatgccctgggaactggggtgc
cccagggggagaatggggaagagctgacctgcgccctcagtaggaacaag
gtaagatgaaagaatgacagaaacagaatgagggattttcaggcaagggg
gaaggaagggcagttttggtgaaaggactgtagctgactggtggggggct
ggctttggaaatactttgaggggatcctgagactggactctagactctcc
cctggttgttcccttccccgagttctggccggttcttggaccagacaagg
catggcccaagaaggtagatcagaattttttagcctttttttcattagtg
ccttccctagtataattccagattttttttcttaatcacatgaaatttta
ataccacagatatactatacatctgtttatgttctgtatatgttctgtgc
tttatacgtaaaaaagagtaagattttttttcacctccccttttaagaat
cagttttaattcccttgagaatgcttgttatagattgaaggctggtaagg
ggttgggctcctctttcttcttcctggtgccagagtgctcccacatgaag
gaataggaaaggaagatgcaaagagggaaatccttcgaacacatgaagac
acaggaagaggcctcttagggctccaagggctccagggaagcagctgcag
aggttgggtggggtgaggggccaggatccactgaccctggggccaggcag
gaatcactctgttgcctggggctcagaaggcagtatcacccatggttcct
gtcattgctcatgtattttgcctttcaacaattattgtgcacctactgtg
tgcaggccctgcctggacactggggatgcgcagtggatgcactgggctct
gcctttgagggttgcagtttaatgggtgacaggtaattataaggaagaag
gtgagtgcagagtgggaggcttggaggctgtggggcttggggtgggggag
ctcacatccagcctctgggccaaggccaggaggcttcccagagcaggaga
cagagcagggtattgtggtggggggtgtcctttttggggctgggatctgc
actttacagtttgaggggatgggcagaggaggctgggcttcattctggag
gtggggacatggtgaggtgaggtttagaaagcacacctgagccgcagtgt
gtaggatgctggaaatggtggagatgggcctgcgaagagagtgctgggaa
gtgatgacccaggagcagcagccgggcacctaacaatgggtcagcaccgt
gggcgtggagacaaaggccgggattgatcaatacccgagaagtacaatgt
acaggacttgggctccatttggatggagtgggtgagggaggagtcagaaa
tggcttccgatttccagcttgggcctggggattggagatgtccccactga
gagtagggcacaagtgaggaaatggtttggagaggaagatgataagttac
atcatggatgtgctgagtctgagttgcctatgggacttggaatggggggt
ggcaaaaggtgtgtgatcttgagcaagatattcaactcttctgggccttg
gtcttctcatttgtaaaacggtgataagaatattacttcccatttgtgtt
gctgtgaatattaaatgcgctaccacatgt

Thank you for taking the time to go through my problem. 感谢您抽出宝贵时间解决我的问题。

Any help and input would be deeply appreciated. 任何帮助和意见将深表感谢。

Thank you for taking the time to go through my problem! 感谢您抽出宝贵时间来解决我的问题!

The following seems to work. 以下似乎有效。 While playing around, I put the data you posted in the __DATA__ section at the end of the script. 在玩游戏时,我将您发布的数据放在脚本末尾的__DATA__部分。 To use it with a real data file, you'll need to open it, and pass the file handle to run . 要将它与真实数据文件一起使用,您需要打开它,并传递文件句柄才能run

#!/usr/bin/env perl

use strict; use warnings;
use Data::Dumper;
use List::MoreUtils qw( first_index );

if (@ARGV) {
    my ($input_file) = @ARGV;
    open my $input, '<', $input_file
        or die "Cannot open '$input_file': $!";
    run($input);
    close $input
        or die "Cannot close '$input_file': $!";
}
else {
    run(\*DATA);
}

sub run {
    my ($fh, $start_pat, $stop_pat) = @_;

    # These are your patterns. I changed $npat because I don't
    # think, e.g., q is a valid character in your input.
    my $gpat = '[g]{3,5}';
    my $npat = '[acgt]{1,25}';
    my $wanted = qr/$gpat$npat$gpat$npat$gpat$npat$gpat/;

    # These just tell us where a sequence begins and ends.
    my $start = qr/\A>([A-Za-z_0-9]+)/;
    my $stop = qr/[^acgt]/;

    # Set up the bins and labels for the histogram.
    my @bins = map 100 * $_, 1 .. 10;
    my @labels = map sprintf('%d - %d', $_ - 100, $_), @bins;

    # Initialize the histogram with all zero counts.
    my %hist = map { $_ => 0 } @labels;

    my $id;
    while (my $line = <$fh>) {
        # Whenever you see a new sequence, read it completely
        # and pass it to build_histogram.
        if (($id) = ($line =~ $start)) {
            print "Start sequence: '$id':\n";
            my $seq_ref;
            ($line, $seq_ref) = read_sequence($fh, $stop);

            my $hist = build_histogram(
                $seq_ref,
                $wanted,
                \@bins,
                \@labels,
            );

            # Add the counts from this sequence to the overall
            # histogram.

            for my $key ( keys %$hist ) {
                $hist{ $key } += $hist->{$key};
            }

            # exit loop if read_sequence stopped because of EOF.
            last unless defined $line;

            # else see if the line that stopped input is the start
            # of a new sequence.
            redo;
        }
    }

    print Dumper \%hist;
}

sub build_histogram {
    my ($seq_ref, $wanted, $bins, $labels) = @_;

    my %hist;

    while ($$seq_ref =~ /$wanted/g) {
        # Whenever we find segment which matches what we want,
        # store the position,
        my $pos = $-[0];

        # and find the bin where it fits.
        my $idx = first_index { $_ > $pos } @$bins;

        # if you do not have List::MoreUtils, you should install it
        # however, the grep can be used instead of first_index
        # my ($idx) = grep { $bins->[$_] > $pos } 0 .. $#$bins;
        # $idx = -1 unless defined $idx;

        # if it did not fit in the bins, then the position must
        # be greater than the upper limit of the last bin, put
        # it in "> than upper limit of last bin".
        my $key = ($idx == -1 ? "> $bins->[-1]" : $labels->[$idx]);
        $hist{ $key } += 1;
    }

    # we're done matching, return the histogram for this sequence
    return \%hist;
}

sub read_sequence {
    my ($fh, $stop) = @_;

    my ($line, $seq);

    while ($line = <$fh>) {
        $line =~ s/\s+\z//;
        last if $line =~ $stop;
        $seq .= $line;
    }

    return ($line, \$seq);
}

__DATA__

-- Either paste your data here, or pass the name
-- of your input file on the command line

Output: 输出:

Start sequence: 'NR_037701':
Start sequence: 'NR_002714':
Start sequence: 'NR_003569':
Start sequence: 'NR_002817':
$VAR1 = {
          '700 - 800' => 0,
          '> 1000' => 10,
          '200 - 300' => 1,
          '900 - 1000' => 1,
          '800 - 900' => 1,
          '500 - 600' => 0,
          '0 - 100' => 0,
          '100 - 200' => 1,
          '300 - 400' => 0,
          '400 - 500' => 0,
          '600 - 700' => 0
        };

Also, you should take Chris Charley's advice and use Bio::SeqIO to read sequences rather than my homebrewed read_sequence function. 此外,您应该采用Chris Charley的建议并使用Bio :: SeqIO来读取序列而不是我的自制read_sequence函数。 I was just too lazy to install BioPerl just for the purpose of answering this question. 我只是为了回答这个问题而懒得安装BioPerl

This is pretty much the same as your previous problem except that the intervals are independent of the length of the sequence and so can be defined just once instead of changing them for every sequence. 这与您之前的问题几乎相同,只是间隔与序列的长度无关,因此可以只定义一次,而不是为每个序列更改它们。

This program is a modification of my previous solution. 该程序是我以前的解决方案的修改。 As I described, it starts with a fixed set of values in @offsets from 100 to 1000 in steps of 100 , and the final range > 1000 is terminated at 2E9 or 2 billion. 正如我所描述的那样,它从@offsets一组固定值开始,从1001000 ,步长为100 ,最终范围> 1000 ,终止于2E9或20亿。 This is close to the maximum positive 32-bit integer and serves to catch all offsets above 1000. I assume you won't be dealing with sequences any bigger than this? 这接近最大正32位整数,用于捕获1000以上的所有偏移量。我假设您不会处理比这更大的序列?

The @totals and @counts arrays are initialised to zeroes with the same number of elements as the @offsets array. @totals@counts数组初始化为零,其元素数与@offsets数组相同。

Otherwise the functionality is much as before. 否则功能就像以前一样。

use strict;
use warnings;

use List::MoreUtils 'firstval';

my $gpat = '[G]{3,5}';
my $npat = '[A-Z]{1,25}';
my $pattern = $gpat.$npat.$gpat.$npat.$gpat.$npat.$gpat;
my $regex = qr/$pattern/i;

open my $fh, '<', 'small.fa' or die $!;

my @offsets = map $_*100, 1 .. 10;
push @offsets, 2E9;
my @totals = (0) x @offsets;

my ($id, $seq);

while (<$fh>) {

  chomp;

  if (/^>(\w+)/) {
    process_seq($seq) if $id;
    $id = $1;
    $seq = '';
    print "$id\n";
  }
  elsif ($id) {
    $seq .= $_;
    process_seq($seq) if eof;
  }
}

print "Total: @totals\n";



sub process_seq {

  my $sequence = shift;

  my @counts = (0) x @offsets;

  while ($sequence =~ /$regex/g) {
    my $place = $-[0];
    my $i = firstval { $place < $offsets[$_] } keys @offsets;
    $counts[$i]++;
  }

  print "Counts: @counts\n\n";
  $totals[$_] += $counts[$_] for keys @totals;
}

output 产量

Running this program against your new data file small.fa produces 针对新数据文件small.fa生成此程序

Total: 1 1 0 0 0 0 0 1 0 1 10

But using the data from the previous question, sample.fa is much more interesting 但是使用上一个问题的数据, sample.fa更有趣

Total: 5 4 1 0 0 2 2 1 0 0 1

Generally, in Perl you can count the occurrence of a pattern by: 通常,在Perl中,您可以通过以下方式计算模式的出现次数:

 $_ = $input;
 my $c = 0;
 $c++ while s/pattern//s;

I was finally able to figure out where I was going wrong with my code. 我终于能够弄清楚我的代码出错了。 It turned out to be a looping problem. 结果证明这是一个循环问题。 The following code works perfectly. 以下代码完美无缺。 I have marked it in comments the places where I made the modification. 我在评论中将其标记为我进行修改的地方。

#!/usr/bin/perl -w

use strict;
use warnings;

my $gpat    = '[G]{3,5}';
my $npat    = '[A-Z]{1,25}';
my $pattern = $gpat . $npat . $gpat . $npat . $gpat . $npat . $gpat;
my $regex   = qr/$pattern/i;

open OUT, ">Quadindividual.refMrna.fa" or die;
open my $fh, '<', 'refMrna.fa' or die $!;

my ( $id, $seq );    # can be written as my $id; my $seq;
my @totals = ( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, );    #intialize the @total arrays.
my @thousandcounts = (0);

while (<$fh>) {

  chomp;

  if (/^>(\w+)/) {
    process_seq($seq) if $id;
    $id  = $1;
    $seq = '';
    print "$id\n";
    print OUT "$id\n";
  }
  elsif ($id) {
    $seq .= $_;
    process_seq($seq) if eof;
  }
}

print "Totals : @totals\n";
print OUT "Totals : @totals \n";

print "Thousand Counts total : @thousandcounts\n";
print OUT "Thousand Counts total : @thousandcounts\n";

sub process_seq {

  my $sequence = shift @_;

  my $subseq = substr $sequence, 0, 1000;
  my $length = length $subseq;
  print $length, "\n";

  my @offsets = map { sprintf '%.0f', $length * $_ / 10 } 1 .. 10;
  print "Offsets of 10 divisions: @offsets\n";

  my @counts = ( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, );
  my @count = (0);

  # *MODIFICATION*
  # This if loop was intialized from my @offsets above earlier
  if ( $length eq 1000 ) {
    while ( $sequence =~ /$regex/g ) {
      my $place = $-[0];
      print $place, "\n\n";

      if ( $place <= 1000 ) {
        for my $i ( 0 .. 9 ) {
          next if $place >= $offsets[$i];
          $counts[$i]++;
          last;
        }

      }

      if ( $place > 1000 ) {

        for my $i (0) {
          $count[$i]++;
          last;
        }
      }

    }    #*MODIFICATION*
         #The following commands were also subsequently shifted to ..
         #...properly compute the total

    print "Counts : @counts\n\n";

    $totals[$_] += $counts[$_] for 0 .. 9;

    print "Count : @count\n\n";

    $thousandcounts[$_] += $count[$_] for 0;
  }

  elsif ( $length != 1000 ) {

    my $substr = join ' ', unpack '(A100)*', $sequence;

    my @offsets =
        map { sprintf '%.0f', $length * $_ / ( $length / 100 ) } 1 .. 10;
    print "Offsets of 10 divisions: @offsets\n";

    my @counts = ( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, );

    while ( $sequence =~ /$regex/g ) {
      my $place = $-[0];
      print "Place : $place", "\n\n";
      for my $i ( 0 .. 9 ) {
        next if $place >= $offsets[$i];
        $counts[$i]++;
        last;
      }
    }
    print "Counts : @counts\n\n";

    $totals[$_] += $counts[$_] for 0 .. 9;

  }

}    #subroutine ends

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM