简体   繁体   中英

Perl: Compare hash keys along multiple hashes

I don't know if I'm facing this problem well. I have a file with Identifications, and then 10 files with some of those Identifications with a Database name (the same for each Id but different between files). What I'm trying to do is match all Ids of those 10 files with the file with only Identifications, except if an Identification have been matched previously.

Those 10 files are something like this:

File 1:
Id   Data Data Data Database_name 
Id1  ...  ...  ...    GenBank
...
Id20 ...  ...   ...   GenBank

File 2: 
Id   Data  Data Data Database_name
Id2  ...   ...  ...     IMG
Id30 ...   ...  ...     IMG
...

For each file I put these two values (Id and Database_name) in a double keyed hash. Using this code:

if ( -e "result_GenBank" ){
    print "Yes, it exist!!!! \n";
    open FILE,'<', "result_GenBank" or die "Error Importing GenBank";
    while (my $line=<FILE>){
        chomp ($line);
        my($ClustId, $M5, $Identity, $Evalue, $Bit_score, $Id, $Protein, $Specie, $DB ) = split /\t/g, $line; 

        $GenBank{$ClustId}{$DB}=1;
    }
    close FILE;
}

if ( -e "result_KEEG" ){
    print "Yes, it exist!!!! \n";
    open FILE,'<', "result_KEEG" or die "Error Importing KEEG";
    while (my $line=<FILE>){
        chomp ($line);
        my($ClustId, $M5, $Identity, $Evalue, $Bit_score, $Id, $Protein, $Specie, $DB ) = split /\t/g, $line; 

        $KEEG{$ClustId}{$DB}=1;
    }
    close FILE;
}

For the file with only the Ids, I also put it in a hash:

 open FILE,'<', "Ids" or die "No Input";
while (my $line=<FILE>){
    chomp ($line);
    $key=$line;
    $total_ID{$key} = 1;

}
close FILE;

Now, I need a loop, to compare each double keyed hash (Id and DB_name) with the hash with only one key (Id). If the Id match, then print Id and Db_name, except if the Id have been matched previously, in order to avoid to have the same Id with two different Db_names.

First, you state that you want to deduplicate the ID–DB pairs, so that each ID is only associated with one DB. Therefore we can take a shortcut and do

$GenBank{$ClustId} = $DB;

while building the hashes.

Secondly, The %GenBank and %KEEG hashes are essentially a part of the same data structure. The naming of these variables suggest that you actually wanted them to be entries in a larger hash. Then, we can also remove that awful code duplication:

use feature 'say'; use autodie;

my @files = qw/GenBank KEEG/; # the physical files have e "result_" prefix

my %tables;
for my $file (grep { -e "result_$_" } @files ) {
    say STDERR "The $file file was found";
    open my $fh, '<', "result_$file";

    while (<$fh>){
        chomp;
        my($ClustId, $M5, $Identity, $Evalue, $Bit_score, $Id, $Protein, $Specie, $DB ) = split /\t/; 
        $table{$file}{$ClustId} = $DB;
    }
}

But wait: If we want to unify the IDs later on, we can just save them in the same hash! Also, the current code lets the last DB entry for a given ID win out; we want to change that so that the first entry is remembered. This is easy with the // defined-or operator available since perl5 v10.

my %DB_by_ID;
for my $file (grep { -e "result_$_" } qw/GenBank KEEG/ ) {
    ...;
    while (<$fh>){
        ...;
        $DB_by_ID{$ClustId} //= $DB;
    }
}

My third point is that your ID file represents an array, not a hash. If you want to deduplicate the entries in the Ids file, then it is generally best to use uniq from List::MoreUtils :

use List::MoreUtils 'uniq';

my @IDs;

open my $fh, "<", "Ids"; # no error handling neccessary with autodie
while (<$fh>) {
  chomp;
  push @IDs, $_;
}

@IDs = uniq @IDs;

I must admit the above code looks terribly silly. This is why we'll use File::Slurp :

use List::MoreUtils 'uniq';
use File::Slurp;

my @IDs = uniq read_file('Ids', chomp => 1);

Now all that is left to do is iterating the %DB_by_ID table with the IDs given in @IDs , and print out the result. This would look something like

for my $id (@IDs) {
  if (not exists $DB_by_ID{$id}) {
    warn "no entry for ID=$id";
    next;
  }
  say join "\t", $id, $DB_by_ID{$id};
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM