简体   繁体   English

Perl:将字符推送到数组的哈希值

[英]Perl: push character to hash of array

I am calculating log-odds scores of sequences and returning the motif (small section of sequence) that gives the maximum score. 我正在计算序列的log-odds分数并返回给出最大分数的motif(序列的小部分)。 I have code that calculates the maximum score for each sequence in my file, and I am having trouble storing the motif that gives that score. 我有代码计算我的文件中每个序列的最高分数,我在存储提供该分数的主题时遇到问题。 See my other post(s) for file format, general calculation of log-odds scores, etc Perl: Creating and manipulating hash of arrays for log-odds scores of DNA sequences . 请参阅我的其他文章,了解文件格式,日志记录分数的一般计算等.Perl:为DNA序列的对数分数创建和操作数组的散列 My code is as follows: 我的代码如下:

use strict;
use warnings;
use List::Util 'max';
use Data::Dumper; 

#USER SPECIFICATIONS
#User specifies motif width
my $width = 3;

#User enters the filename that contains the sequence data
print "Please enter the filename of the fasta sequence data: ";
my $filename1 = <STDIN>;

#Remove newline from file
chomp $filename1;

#Open the file and store each dna seq in hash
my %id2seq = ();
my %HoA = ();
my %loscore = ();
my %maxscore = ();
my %maxmot = ();
my $id = '';
open (FILE, '<', $filename1) or die "Cannot open $filename1.",$!;
my $dna;
while (<FILE>)
{
    if($_ =~ /^>(.+)/)
    {
        $id = $1; #Stores 'Sequence 1' as the first $id, for example
    }
    else
    {
        $HoA{$id} = [ split(//) ]; #Splits the contents to allow for position reference later
        $id2seq{$id} .= $_; #Creates a hash with each seq associated to an id number
        $maxmot{$id} = (); #Creates empty hash to push motifs to
        foreach $id (keys %HoA)
        {
            for my $len (0..(length($HoA{$id})-$width-1))
            {
                push @{ $loscore{$id} }, 0;
            }
        }
        push @{ $maxscore{$id} }, -30; #Creates a HoA with each id number to have a maxscore (initial score -30)
    }
}
close FILE;

foreach $id (keys %id2seq)
{
    print "$id2seq{$id}\n\n";
}
print "\n\n";



#EXPECTATION
#Create log-odds table of motifs
my %logodds;
$logodds{'A'}[0] = 0.1;
$logodds{'A'}[1] = 0.2;
$logodds{'A'}[2] = 0.3;
$logodds{'C'}[0] = 0.2;
$logodds{'C'}[1] = 0.5;
$logodds{'C'}[2] = 0.2;
$logodds{'G'}[0] = 0.3;
$logodds{'G'}[1] = 0.2;
$logodds{'G'}[2] = 0.4;
$logodds{'T'}[0] = 0.4;
$logodds{'T'}[1] = 0.1;
$logodds{'T'}[2] = 0.1;

#MAXIMIZATION
#Determine location for each sequence that maximally
#aligns to the motif pattern

foreach $id (keys %HoA)
{   
    for my $pos1 (0..length($HoA{$id})-$width-1)    #Look through all positions the motif can start at
    {
        for my $pos2 ($pos1..$pos1+($width-1))  #Define the positions within the motif (0 to width-1)
        {           
            for my $base (qw( A C G T))
            {
                if ($HoA{$id}[$pos2] eq $base)  #If the character matches a base:
                {
                    for my $pos3 (0..$width-1)  #Used for position reference in %logodds
                    {
                        #Calculate the log-odds score at each location
                        $loscore{$id}[$pos2] += $logodds{$base}[$pos3];

                        #Calculate the maximum log-odds score for each sequence

                        #Find the motif that gives the maximum score for each sequence
                        $maxscore{$id} = max( @{ $loscore{$id} });
                        if ($loscore{$id}[$pos2] == $maxscore{$id})
                        {
                            push @{ $maxmot{$id} }, $HoA{$id}[$pos3]; #NOT SURE THAT THIS IS THE CORRECT WAY TO PUSH WHAT I WANT
                        }
                    }
                }
            }
        }
    }
}

print "\n\n";
print Dumper(\%maxmot);

The expected output for the %maxmot should be something like this: %maxmot的预期输出应该是这样的:

'Sequence 11' => [ 'C', 'A', 'T'], 'Sequence 14' => ['C', 'T', 'G'], etc.

There should be three bases in the expected output because the $width = 3 . 预期输出应该有三个基数,因为$width = 3 The output I get gives me multiples of each base, not in any noticeable order (or at least, I cannot notice a pattern): 我得到的输出给了我每个基数的倍数,而不是任何明显的顺序(或至少,我不能注意到一个模式):

'Sequence 11' => [ 'C', 'C', 'C', 'A', 'A', 'A', 'A', 'T', 'A', 'A', 'T', 'T', 'T'], 'Sequence 14' => ['C', 'C', 'T', 'T', 'C', 'C', 'T', 'T', 'T', 'T', 'T'], etc. I believe the issue is rooted in the push @{ $maxmot{$id} }, $HoA{$id}[$pos3]; 'Sequence 11' => [ 'C', 'C', 'C', 'A', 'A', 'A', 'A', 'T', 'A', 'A', 'T', 'T', 'T'], 'Sequence 14' => ['C', 'C', 'T', 'T', 'C', 'C', 'T', 'T', 'T', 'T', 'T'], etc.我认为这个问题植根于push @{ $maxmot{$id} }, $HoA{$id}[$pos3]; step, but I'm not quite sure how to fix it. 一步,但我不太确定如何解决它。 I have tried pushing $HoA{$id}[$pos2] instead, but that does not seem to fix my problem either. 我试过推动$HoA{$id}[$pos2] ,但这似乎也无法解决我的问题。 As always, any and all help is appreciated! 一如既往,任何和所有的帮助表示赞赏! I can clarify if needed, I know my question is a little convoluted. 如果需要,我可以澄清一下,我知道我的问题有点令人费解。 Thank you in advance. 先感谢您。

The push() is not the problem. push()不是问题。 From running your code it becomes obvious that the conditional $loscore{$id}[$pos2] == $maxscore{$id} is true more often than you expect it. 从运行的代码,可以很明显的是,条件$loscore{$id}[$pos2] == $maxscore{$id}true往往比你期望它。

Here are some questions I would ask in a code review: 以下是我在代码审查中会提出的一些问题:

  • why do you use length() on an array? 你为什么在数组上使用length() It does not return the length of the array. 它不返回数组的长度。
  • Isn't for my $base (qw( ACGT)) { if ($HoA{$id}[$pos2] eq $base) {... just an inefficient way of the equivalent my $base = $HoA{$id}[$pos2]; 不是for my $base (qw( ACGT)) { if ($HoA{$id}[$pos2] eq $base) {...只是一个低效的方式相当于my $base = $HoA{$id}[$pos2]; ?
  • the calculation for each $pos2 is executed $pos2 + 1 times, ie once for 0 , twice for 1 , ... ie later positions get a higher score. 每个$pos2的计算执行$pos2 + 1次,即一次为0 ,两次为1 ,...即后来的位置获得更高的分数。
  • one calculation for $loscore{$id}[$pos2] is the sum of @{ $logodds{$base} } , ie the base at position $pos2 + $pos3 is ignored for the calculation $loscore{$id}[$pos2]一个计算是@{ $logodds{$base} }的总和,即计算位置$pos2 + $pos3
  • you are re-calculating $maxscore{$id} while running over the sequences and then use the changing value in the conditional 在重复运行序列时重新计算$maxscore{$id} 然后在条件中使用更改值
  • (my guess) a motif is supposed to be $width bases long, but your code only stores single bases into %maxmot (我的猜测)一个主题应该是$width %maxmot long,但是你的代码只将单个基数存储到%maxmot

I'm making an educated guess and propose that the following is the correct algorithm. 我正在做一个有根据的猜测,并建议以下是正确的算法。 I'm using the 3 sequences you have given in your previous question. 我正在使用您在上一个问题中给出的3个序列。 I'm also dumping the other 2 hashes, so that the calculation results become visible. 我还倾倒了其他2个哈希值,以便计算结果可见。

I took the liberty to rewrite your code to be more concise and clear. 我冒昧地重写你的代码,使其更加简洁明了。 But you should be able to trace back the lines in the new code to your original code. 但是您应该能够将新代码中的行追溯到原始代码。

#!/usr/bin/perl
use strict;
use warnings;

use List::Util 'max';
use Data::Dumper;

my $width = 3;

my %HoA;
my %maxpos;
my %loscore;
my $id = '';
while (<DATA>) {
    if (/^>(.+)/) {
        $id = $1;
    } else {
        $HoA{$id}     = [ split(//) ];
        $maxpos{$id}  = @{ $HoA{$id} } - $width - 1;
        $loscore{$id} = [ (0) x ($maxpos{$id} + 1) ];
    }
}

my %logodds = (
    A => [0.1, 0.2, 0.3],
    C => [0.2, 0.5, 0.2],
    G => [0.3, 0.2, 0.4],
    T => [0.4, 0.1, 0.1],
);

#MAXIMIZATION
my %maxscore;
my %maxmot;

# Calculate the log-odds score at each location
foreach $id (keys %HoA) {
    for my $index (0..$maxpos{$id}) {
        for my $offset (0..$width-1) {
            # look at base in sequence $id at $offset after $index
            my $base = $HoA{$id}[$index + $offset];
            $loscore{$id}[$index] += $logodds{$base}[$offset];
        }
    }
}

# Calculate the maximum log-odds score for each sequence
foreach $id (keys %HoA) {
    $maxscore{$id} = max( @{ $loscore{$id} });
}

# Find the motif that gives the maximum score for each sequence
foreach $id (keys %HoA) {
    for my $index (0..$maxpos{$id}) {
        if ($loscore{$id}[$index] == $maxscore{$id}) {
            # motif of length $width
            my $motif = join('', @{ $HoA{$id} }[$index..$index + $width - 1]);
            $maxmot{$id}->{$motif}++;
        }
    }
}

print Data::Dumper->Dump([\%loscore, \%maxscore, \%maxmot],
                         [qw(*loscore *maxscore *maxmot)]);

exit 0;

__DATA__
>Sequence_1
TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT
>Sequence_2
CCCACGCAGCCGCCCTCCTCCCCGGTCACTGACTGGTCCTG
>Sequence_3
TCGACCCTCTGGAACCTATCAGGGACCACAGTCAGCCAGGCAAG

Test run: 测试运行:

$ perl dummy.pl
%loscore = (
             'Sequence_1' => [
                               '1.2',
                               '0.8',
                               '0.6',
                               '0.8',
                               '0.5',
                               '0.8',
                               '1',
                               '0.8',
                               '0.4',
                               '0.5',
                               '0.8',
                               '0.7',
                               '0.5',
                               '0.9',
                               '0.6',
                               '0.4',
                               '0.3',
                               '0.6',
                               '0.8',
                               '0.7',
                               '0.4',
                               '1.2',
                               '0.5',
                               '0.3',
                               '0.6',
                               '0.7',
                               '1.1',
                               '0.8',
                               '0.4',
                               '0.7',
                               '1',
                               '0.5',
                               '1.1',
                               '1',
                               '0.6',
                               '0.7',
                               '0.5',
                               '1.1',
                               '0.8'
                             ],
             'Sequence_2' => [
                               '0.9',
                               '1',
                               '0.6',
                               '1',
                               '0.6',
                               '1.1',
                               '0.8',
                               '0.5',
                               '1',
                               '1.1',
                               '0.6',
                               '1',
                               '0.9',
                               '0.8',
                               '0.5',
                               '1.1',
                               '0.8',
                               '0.5',
                               '1.1',
                               '0.9',
                               '0.9',
                               '1.1',
                               '0.8',
                               '0.6',
                               '0.6',
                               '1.2',
                               '0.6',
                               '0.7',
                               '0.7',
                               '0.9',
                               '0.7',
                               '0.7',
                               '0.7',
                               '1',
                               '0.6',
                               '0.6',
                               '1.1',
                               '0.8',
                               '0.7'
                             ],
             'Sequence_3' => [
                               '1.3',
                               '0.7',
                               '0.7',
                               '0.8',
                               '0.9',
                               '0.8',
                               '0.5',
                               '1',
                               '0.7',
                               '1',
                               '0.8',
                               '0.8',
                               '0.5',
                               '0.8',
                               '0.8',
                               '0.6',
                               '0.7',
                               '0.4',
                               '1.2',
                               '0.8',
                               '0.7',
                               '0.9',
                               '0.8',
                               '0.7',
                               '0.8',
                               '1',
                               '0.6',
                               '0.9',
                               '0.8',
                               '0.4',
                               '0.6',
                               '1.2',
                               '0.8',
                               '0.5',
                               '1',
                               '1',
                               '0.8',
                               '0.7',
                               '0.7',
                               '1.1',
                               '0.7',
                               '0.7'
                             ]
           );
%maxscore = (
              'Sequence_1' => '1.2',
              'Sequence_3' => '1.3',
              'Sequence_2' => '1.2'
            );
%maxmot = (
            'Sequence_3' => {
                              'TCG' => 1
                            },
            'Sequence_2' => {
                              'TCA' => 1
                            },
            'Sequence_1' => {
                              'TCA' => 2
                            }
          );

This looks much closer to your expected output. 这看起来更接近您的预期输出。 But of course I could be completely off with my guesses... 但当然,我的猜测可能会完全消失......


If I understand the logscore calculation correctly, then the score per motif is a constant and hence can be pre-calculated. 如果我正确理解了对数分数计算,那么每个主题的得分是一个常数,因此可以预先计算。 Which would lead to the following more straightforward approach: 这将导致以下更直接的方法:

#!/usr/bin/perl
use strict;
use warnings;

use Data::Dumper;

my $width = 3;

my %logodds = (
    A => [0.1, 0.2, 0.3],
    C => [0.2, 0.5, 0.2],
    G => [0.3, 0.2, 0.4],
    T => [0.4, 0.1, 0.1],
);

# calculate log score for each motif combination
my $motif_score = {'' => 0}; # start with a 0-length motif
foreach my $offset (0..$width - 1) {
    my %scores;

    # for all motifs...
    foreach my $prefix (keys %{ $motif_score }) {
        my $base_score = $motif_score->{$prefix};

        # ... add another base to the motif
        for my $base (qw(A G C T)) {
            $scores{"${prefix}${base}"} = $base_score + $logodds{$base}[$offset];
        }
    }

    # store the scores for the new sequences
    $motif_score = \%scores;
}

#print Data::Dumper->Dump([$motif_score], [qw(motif_score)]);

my $id;
my %maxmot;
while (<DATA>) {
    if (/^>(.+)/) {
        $id = $1;
    } else {
        chomp(my $sequence = $_);
        my $max = -1;

        # run a window of length $width over the sequence
        foreach my $index (0..length($sequence) - $width - 1) {

            # extract the motif at $index from sequence
            my $motif = substr($sequence, $index, $width);
            my $score = $motif_score->{$motif};

            # update max score if the motif has a higher score
            if ($score > $max) {
                $max         = $score;
                $maxmot{$id} = $motif;
            }
        }
    }
}

print Data::Dumper->Dump([\%maxmot], [qw(*maxmot)]);

exit 0;

__DATA__
>Sequence_1
TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT
>Sequence_2
CCCACGCAGCCGCCCTCCTCCCCGGTCACTGACTGGTCCTG
>Sequence_3
TCGACCCTCTGGAACCTATCAGGGACCACAGTCAGCCAGGCAAG

Test run: 测试运行:

$ perl dummy.pl
%maxmot = (
            'Sequence_2' => 'TCA',
            'Sequence_3' => 'TCG',
            'Sequence_1' => 'TCA'
          );

As the logscore per motif is a constant, the motif list sorted by logscore order will also be a constant. 由于每个基序的logscore是常数,按logscore顺序排序的motif列表也将是常量。 Given that list we will only have to find the first motif that matches on a given sequence. 鉴于该列表,我们只需要找到与给定序列匹配的第一个主题。 Hence we can apply the highly optimized regular expression engine on the problem. 因此,我们可以在问题上应用高度优化的正则表达式引擎。 Depending on your actual problem size this will probably be the more efficient solution: 根据您的实际问题大小,这可能是更有效的解决方案:

#!/usr/bin/perl
use warnings;
use strict;

use List::Util qw(first pairs);
use Data::Dumper;

my $width = 3;

my %logodds = (
    A => [0.1, 0.2, 0.3],
    C => [0.2, 0.5, 0.2],
    G => [0.3, 0.2, 0.4],
    T => [0.4, 0.1, 0.1],
);
my @bases = keys %logodds;

# calculate log score for each motif combination
my $motif_logscore = {'' => 0}; # start with a 0-length motif
foreach my $offset (0..$width - 1) {
    my %score;

    # for all motifs...
    foreach my $prefix (keys %{ $motif_logscore }) {
        my $base_score = $motif_logscore->{$prefix};

        # ... add another base to the motif
        for my $base (@bases) {
            $score{"${prefix}${base}"} = $base_score + $logodds{$base}[$offset];
        }
    }

    # update hash ref to new motif scores
    $motif_logscore = \%score;
}

#print Data::Dumper->Dump([$motif_logscore], [qw(motif_logscore)]);

my @motifs_sorted =
    # list of [<motif>, <regular expression>] array refs
    map    { [$_->[0], qr/$_->[0]/] }
    # sort in descending numerical score order
    sort   { $b->[1] cmp $a->[1] }
    # list of [<motif>, <score>] array refs
    pairs %{ $motif_logscore };

#print Data::Dumper->Dump([\@motifs_sorted], [qw(*motifs_sorted)]);

my $id;
my %maxmot;
while (<DATA>) {
    if (/^>(.+)/) {
        $id = $1;
    } else {
        my $sequence = $_;
        # find the first pair where the regex matches -> store motif
        $maxmot{$id} = (
            first { ($sequence =~ $_->[1])[0] } @motifs_sorted
        )->[0];
    }
}

print Data::Dumper->Dump([\%maxmot], [qw(*maxmot)]);

exit 0;

__DATA__
>Sequence_1
TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT
>Sequence_2
CCCACGCAGCCGCCCTCCTCCCCGGTCACTGACTGGTCCTG
>Sequence_3
TCGACCCTCTGGAACCTATCAGGGACCACAGTCAGCCAGGCAAG

Maybe you don't need an array but a hash? 也许你不需要一个数组但是哈希?

Change the push to 将推送更改为

undef $maxmot{$id}{ $HoA{$id}[$pos3] };

It creates a hash (with undefined values, so only the keys are important). 它创建一个哈希值(具有未定义的值,因此只有键很重要)。 In the output, I don't see more than 3 keys for each sequence in the input from your linked question. 在输出中,我看不到链接问题输入中每个序列的3个以上的键。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM