如何使两个数据文件之间的差异更有效（运行时）

Question

I have a code which compares values on some specific terms between two files. 我有一个代码，用于比较两个文件之间某些特定术语的值。 The main time-consuming part of the code as follows: 该代码主要耗时部分如下：

my @ENTIRE_FILE;
my %NETS;
my %COORDINATES;
my $INT=1;
my %IR_VALUES;
################################# READING
foreach my $IR_REPORT_FILE_1(@IR_REPORT_FILES){
   {
      open (FHIN, "<", $IR_REPORT_FILE_1) or die("Could not open $! for reading\n");
      # chomp(my @ENTIRE_FILE = <FHIN>);                            # READS THE ENTIRE FILE
      local undef $/;
      @ENTIRE_FILE = split(/\n(.*NET.*)/,<FHIN>);
      close (FHIN);
   }
   ############################### BUILDING HASH
   for my $i(1..$#ENTIRE_FILE/2){
     if($ENTIRE_FILE[$i*2-1]=~ /^----.*\s+"(\w+)"\s+/){
       my $net =$1;
       my @ir_values_of_net = split(/\n/,$ENTIRE_FILE[$i*2]);
       for my $val (@ir_values_of_net){
       push ( @{ $NETS{$INT}{$net} }, $val ) if ($val =~ /^r.*\s+m1|v1_viadg|v1_viabar|m2|ay_viabar|ay_viadg|c1\s+/); # NETS{1}{VDD}=array of values, NETS{1}{VSS}, NETS{1}{AVDD}
       }
     }
   }
   $INT++;                                          # For the next file:  NETS{2}{VDD}, NETS{2}{VSS}, NETS{2}{AVDD}
}
############################### COMPARISON
my $loop_count=0;
foreach my $net(keys %{ $NETS{1} }){
   print "net is $net\n";
   foreach my $file_1_net( @{ $NETS{1}{$net} }){
     my @sub_str_1 = split (' ', $file_1_net);
     foreach my $file_2_net ( @{ $NETS{2}{$net} } ){
       $loop_count++;
#        my @sub_str_1 = split (' ', $file_1_net);
       my @sub_str_2 = split (' ', $file_2_net);
       if(($sub_str_1[2] eq $sub_str_2[2])&&(($sub_str_1[3].$sub_str_1[4].$sub_str_1[5].$sub_str_1[6] eq $sub_str_2[3].$sub_str_2[4].$sub_str_2[5].$sub_str_2[6]) || ($sub_str_1[3].$sub_str_1[4].$sub_str_1[5].$sub_str_1[6] eq $sub_str_2[5].$sub_str_2[6].$sub_str_2[3].$sub_str_2[4]))){
         push (@{ $COORDINATES{$net}{X} },$sub_str_1[3],$sub_str_1[5]) if ($sub_str_1[3] && $sub_str_1[5]);
         push (@{ $COORDINATES{$net}{Y} },$sub_str_1[4],$sub_str_1[6]) if ($sub_str_1[4] && $sub_str_1[6]);
         my $difference=$sub_str_1[1]-$sub_str_2[1];
         if($sub_str_1[3]=~/^-/){
           push (@{ $MATCHED_RESISTORS{$net}{$sub_str_1[2].$sub_str_1[3].$sub_str_1[4].$sub_str_1[5].$sub_str_1[6]} }, $file_1_net,$file_2_net,$difference);
         }else{
           push (@{ $MATCHED_RESISTORS{$net}{$sub_str_1[2]."-".$sub_str_1[3].$sub_str_1[4].$sub_str_1[5].$sub_str_1[6]} }, $file_1_net,$file_2_net,$difference);
         }
         push (@{ $IR_VALUES{$net} }, $sub_str_2[1]);
         last;
       }
     }
   } 
   print max @{ $IR_VALUES{$net} };
   print "\nloop count is $loop_count\n";
   $loop_count = 0;
#    <>;  
}

I ran a profiler on the code. 我在代码上运行了探查器。 Below is the output on the above part of code: 以下是上述代码部分的输出：

Some statistics: 一些统计：

For my testcase, the outer-most foreach has 3 elements. 对于我的测试用例，最外面的foreach具有3个元素。 Below is the number of matched elements for each iteration: 以下是每次迭代的匹配元素数：

 element_1: 14 element_1: 316 element_1: 8

The file sizes are 8.3 MB and 518.3KB. 文件大小为8.3 MB和518.3KB。
Run time for the entire code is: 220s 整个代码的运行时间为：220秒
My main concern is when I have a file size of 8.3MB each, and there are more numbers of matching between two files, the run-time is humongous eg 3 hours. 我主要关心的是，当我每个文件的大小为8.3MB，并且两个文件之间的匹配数量更多时，运行时间非常繁琐，例如3个小时。

My question is really simple: How do I make my code run faster? 我的问题很简单：如何使我的代码运行更快？

Sample Data File_1: 样本数据文件_1：

r6_2389         1.29029e-05     ay_viabar       23.076   57.755   22.628   57.755   4.5      0        0        3.68449e-06      -5.99170336965613
r6_2397         1.29029e-05     ay_viabar       22.948   57.755   22.628   57.755   4.5      0        0        3.68449e-06      -5.99170336965613
r6_2400         1.29029e-05     ay_viabar       22.82    57.755   22.628   57.755   4.5      0        0        3.68449e-06      -5.99170336965613
r6_2403         1.29029e-05     ay_viabar       22.692   57.755   22.628   57.755   4.5      0        0        3.68449e-06      -5.99170336965613
r6_971          1.3279e-05      c1              9.492    45.742   -0.011   46.779   0.001    9.5589   10       0.0508653

Sample Data File_2: 样本数据文件_2：

r6_9261         0.00206167      ay_viabar       23.076   57.755   22.628   57.755   4.5      0        0        0.0207546    
r6_9258         0.00206167      ay_viabar       22.948   57.755   22.628   57.755   4.5      0        0        0.0161057    
r6_9399         0.00206167      ay_viabar       22.82    57.755   22.628   57.755   4.5      0        0        0.0127128    
r6_9486         0.00206167      ay_viabar       22.692   57.755   22.628   57.755   4.5      0        0        0.0103186    
r6_1061         1.3279e-05      cb_pc_viadg     -6.696   44.157   -0.159   44.847   0.001    0        0        0

Sample Output: 样本输出：

   r6_9261         0.00206167      ay_viabar       23.076   57.755   22.628   57.755   4.5      0        0        0.0207546    
    r6_9258         0.00206167      ay_viabar       22.948   57.755   22.628   57.755   4.5      0        0        0.0161057    
    r6_9399         0.00206167      ay_viabar       22.82    57.755   22.628   57.755   4.5      0        0        0.0127128    
    r6_9486         0.00206167      ay_viabar       22.692   57.755   22.628   57.755   4.5      0        0        0.0103186

The sample output is basically pushed into another hash which is further processed. 样本输出基本上被推入另一个哈希中，进一步处理。 But building up this hash consumes about 90% of the total run-time as per the profiler. 但是，根据探查器，建立此哈希会消耗大约90％的总运行时间。

Answer 1

OK, so my first thought is - you've a 3 deep loop, and that will always be inefficient. 好的，所以我的第一个想法是-您有3个深度的循环，这样总是效率低下。 We can probably trade memory for a lot of speed there. 我们可能可以在那里以很高的速度交换内存。

Assuming the 'bigger' file is 'sample_1', otherwise swap them. 假设“更大”文件为“ sample_1”，否则交换它们。

In this example - sample_2 will consume memory proportionate to the number of rows - so we ideally want that to be the smaller file. 在此示例中sample_2将消耗与行数成比例的内存-因此，理想情况下，我们希望它是较小的文件。 You may need to swap the match/test around, depending on whether file1 cols 5,6,3,4 matches file2 or vice versa. 您可能需要交换匹配/测试，具体取决于file1 cols 5,6,3,4与file2匹配，反之亦然。

But hopefully this illustrates a useful concept for solving your problem, if not entirely solving it? 但是，希望这说明了一个解决问题的有用概念，即使不能完全解决问题吗？

Something like this will do the trick: 这样的事情将达到目的：

#!/usr/bin/env perl

use strict;
use warnings;

my %is_match; 

open ( my $sample_1, '<', 'sample1.txt' ) or die $!;
open ( my $sample_2, '<', 'sample2.txt' ) or die $!;

#    first of all, column 2 , 3,4,5,6 should match between 2 files. 
# and then print out both matching lines from two files.     
#    column 3,4,5,6 from one file can match with column 5,6,3,4. 

while ( <$sample_2> ) { 
   my @row = split; 
   #insert into hash
   #this would be much clearer if the fields were named rather than numbered
   #I think. 
   $is_match{$row[3]}{$row[4]}{$row[5]}{$row[6]}++; 
   $is_match{$row[5]}{$row[6]}{$row[3]}{$row[4]}++; 
}

while ( <$sample_1> ) {
   my @row = split;
   #print the current line if it matches from the hash above.
   print if $is_match{$row[3]}{$row[4]}{$row[5]}{$row[6]};
}

Because this iterates each file once, it should be a lot faster. 因为这会使每个文件循环一次，所以它应该快得多。 and because one of your files is small, then that's the one you should read first into memory. 并且由于其中一个文件很小，因此您应该首先将其读入内存。

With your sample data as provided, this gives you the desired output. 使用提供的样本数据，这将为您提供所需的输出。

The first loop reads though the file, selects your interest fields and inserts them into a hash, based on your 4 keys. 第一个循环读取文件，选择您感兴趣的字段，然后根据您的4个键将它们插入到哈希中。

And then it does so again for the other set of valid matching keys. 然后针对另一组有效的匹配密钥再次执行此操作。

The second loop reads the other file, selects the keys and just checks if either combination exists in the hash. 第二个循环读取另一个文件，选择键，然后仅检查哈希中是否存在任何组合。 And prints the current line if it does. 并打印当前行（如果有）。

如何使两个数据文件之间的差异更有效（运行时）

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-06-12 13:18:24

如何使两个数据文件之间的差异更有效（运行时）

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-06-12 13:18:24

解决方案1
2 已采纳 2017-06-12 13:18:24