如何快速從兩個數組中查找常見項目？

Question

我試圖找到基於一個字段的兩個制表符分隔的文件之間的共同之處。 第一個文件的一行：

1       52854   s64199.1        A       .       .       .       PR      GT      0/0

第二個文件的一行：

chr1    52854     .       C       T       215.302 .       AB=0.692308;ABP=7.18621;AC=1;AF=0.5;AN=2;AO=9;CIGAR=1X;DP=13;DPB=13;DPRA=0;EPP=3.25157;EPPR=3.0103;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=17.5429;PAIRED=0;PAIREDR=0.25;PAO=0;PQA=0;PQR=0;PRO=0;QA=318;QR=138;RO=4;RPP=3.25157;RPPR=5.18177;RUN=1;SAF=0;SAP=22.5536;SAR=9;SRF=1;SRP=5.18177;SRR=3;TYPE=snp;technology.illumina=1;BVAR  GT:DP:RO:QR:AO:QA:GL    0/1:13:4:138:9:318:-5,0,-5

在此示例中，基於第二個字段（52854），我有很多東西。 這是找到常見代碼的代碼，但是我的文件很大，需要很多時間。 有什么辦法可以加快這個過程？ 提前非常感謝您。

#!/app/languages/perl/5.14.2/bin/perl
use strict;
use warnings;
my $map_file = $ARGV[0];
my $vcf_file = $ARGV[1];
open my $map_info, $map_file or die "Could not open $map_file: $!";

my @map_array = ();
my @vcf_array = ();
while( my $mline = <$map_info>)  {
    chomp $mline;
    my @data1 = split('\t', $mline);
    my $pos1 = $data1[1];
    push (@map_array, $pos1);
}
open my $vcf_info, $vcf_file or die "Could not open $vcf_file: $!";
while( my $line = <$vcf_info>)  {
    if ($line !~ m/^#/) {
            push (@vcf_array, $line);
    }
}
foreach my $a (@map_array) {
    chomp $a;
foreach my $b (@vcf_array) {
            chomp $b;
            my @data = split('\t', $b);
            my $pos2 = $data[1];
            my $ref2 = $data[3];
            my $allele = $data[4];
            my $genotype = $data[9];
            if ($a == $pos2) {
               print $pos2 . "\t" . $ref2. "\t".$allele."\t".$genotype. "\n";     
            #print "$b\n";
            }

    }
}

Answer 1

以下請查找對基於哈希的搜索的腳本的最小修改

use strict;
use warnings;
my $map_file = $ARGV[0];
my $vcf_file = $ARGV[1];

my %vcf_hash;
open( my $vcf_info, $vcf_file) or die "Could not open $vcf_file: $!";
while( my $line = <$vcf_info>)  {
    next if $line =~ m/^#/; # Skip comment lines
    chomp $line;
    my (@data) = split(/\t/, $line);
    die unless @data >= 10; # Check number of fields in the input line
    my ($pos) = $data[1];
    # $. - line number in the file
    $vcf_hash{$pos}{$.} = \@data;
}

open( my $map_info, $map_file) or die "Could not open $map_file: $!";
while( my $mline = <$map_info>)  {
    chomp $mline;
    my (@data) = split(/\t/, $mline);
    die unless @data >= 2; # Check number of fields in the input line
    my ($pos) = $data[1];
    if( exists $vcf_hash{$pos}) {
      my $hash_ref = $vcf_hash{$pos};
      for my $n (sort{$a<=>$b} keys %$hash_ref) {
        my $array_ref = $hash_ref->{$n};
        my $pos2     = $array_ref->[1];
        my $ref2     = $array_ref->[3];
        my $allele   = $array_ref->[4];
        my $genotype = $array_ref->[9];
        print $pos2 . "\t" . $ref2. "\t".$allele."\t".$genotype. "\n";
      }
    }
}

如果您使用大量數據文件，可以進一步改進腳本以減少內存使用。

Answer 2

這是一個版本，其運行速度應比您自己的版本快得多

它讀取映射文件，並將每個pos字段存儲在哈希%wanted 。 然后，它讀取第二個文件，並檢查記錄是否在所需值列表中。 如果是這樣，它將拆分記錄並打印您需要的字段

請注意，除了確保它可以編譯外，我無法對其進行測試

use strict;
use warnings;
use 5.010;
use autodie;

my ( $map_file, $vcf_file ) = @ARGV;

my %wanted;

{
    open my $map_fh, '<', $map_file;

    while ( <$map_fh> ) {
        chomp;
        my $pos = ( split /\t/, $_, 3 )[1];
        ++$wanted{$pos};
    }
}

{
    open my $vcf_fh, '<', $vcf_file;

    while ( <$vcf_fh> ) {

        next if /^#/;

        chomp;
        my $pos = ( split /\t/, $_, 3 )[1];
        next unless $wanted{$pos};

        my ( $ref, $allele, $genotype ) = ( split /\t/ )[3, 4, 9];
        print join("\t", $pos, $ref, $allele, $genotype), "\n";

    }
}

Answer 3

不需要將map_file保留在內存中，而只需要保留鍵即可。 最好將它們設置為用於存在性檢查的哈希鍵。 您也不vcf_file保留在內存中，但是您可以決定是否輸出。

#!/app/languages/perl/5.14.2/bin/perl
use strict;
use warnings;
use autodie;

use constant KEY => 1;
use constant FIELDS => ( 1, 3, 4, 9 );

my ( $map_file, $vcf_file ) = @ARGV;

my %map;
{
    my $fh;
    open $fh, '<', $map_file;

    while (<$fh>) {
        $map{ ( split /\t/, $_, KEY + 2 )[KEY] } = undef;
    }
}

{
    my $fh;
    open $fh, '<', $vcf_file;
    while (<$fh>) {
        next if /^#/;
        chomp;
        my @data = split /\t/;
        print join "\t", @data[FIELDS] if exists $map{ $data[KEY] };
    }
}

如何快速從兩個數組中查找常見項目？

問題描述

3 個解決方案

解決方案1
1 2015-06-28 17:59:46

解決方案2
1 2015-06-28 18:36:12

解決方案3
0 2015-06-29 12:57:03

如何快速從兩個數組中查找常見項目？

問題描述

3 個解決方案

解決方案1 1 2015-06-28 17:59:46

解決方案2 1 2015-06-28 18:36:12

解決方案3 0 2015-06-29 12:57:03

解決方案1
1 2015-06-28 17:59:46

解決方案2
1 2015-06-28 18:36:12

解決方案3
0 2015-06-29 12:57:03