[英]how to make a difference between two data files more efficient (run time)
I have a code which compares values on some specific terms between two files. 我有一个代码,用于比较两个文件之间某些特定术语的值。 The main time-consuming part of the code as follows: 该代码主要耗时部分如下:
my @ENTIRE_FILE;
my %NETS;
my %COORDINATES;
my $INT=1;
my %IR_VALUES;
################################# READING
foreach my $IR_REPORT_FILE_1(@IR_REPORT_FILES){
{
open (FHIN, "<", $IR_REPORT_FILE_1) or die("Could not open $! for reading\n");
# chomp(my @ENTIRE_FILE = <FHIN>); # READS THE ENTIRE FILE
local undef $/;
@ENTIRE_FILE = split(/\n(.*NET.*)/,<FHIN>);
close (FHIN);
}
############################### BUILDING HASH
for my $i(1..$#ENTIRE_FILE/2){
if($ENTIRE_FILE[$i*2-1]=~ /^----.*\s+"(\w+)"\s+/){
my $net =$1;
my @ir_values_of_net = split(/\n/,$ENTIRE_FILE[$i*2]);
for my $val (@ir_values_of_net){
push ( @{ $NETS{$INT}{$net} }, $val ) if ($val =~ /^r.*\s+m1|v1_viadg|v1_viabar|m2|ay_viabar|ay_viadg|c1\s+/); # NETS{1}{VDD}=array of values, NETS{1}{VSS}, NETS{1}{AVDD}
}
}
}
$INT++; # For the next file: NETS{2}{VDD}, NETS{2}{VSS}, NETS{2}{AVDD}
}
############################### COMPARISON
my $loop_count=0;
foreach my $net(keys %{ $NETS{1} }){
print "net is $net\n";
foreach my $file_1_net( @{ $NETS{1}{$net} }){
my @sub_str_1 = split (' ', $file_1_net);
foreach my $file_2_net ( @{ $NETS{2}{$net} } ){
$loop_count++;
# my @sub_str_1 = split (' ', $file_1_net);
my @sub_str_2 = split (' ', $file_2_net);
if(($sub_str_1[2] eq $sub_str_2[2])&&(($sub_str_1[3].$sub_str_1[4].$sub_str_1[5].$sub_str_1[6] eq $sub_str_2[3].$sub_str_2[4].$sub_str_2[5].$sub_str_2[6]) || ($sub_str_1[3].$sub_str_1[4].$sub_str_1[5].$sub_str_1[6] eq $sub_str_2[5].$sub_str_2[6].$sub_str_2[3].$sub_str_2[4]))){
push (@{ $COORDINATES{$net}{X} },$sub_str_1[3],$sub_str_1[5]) if ($sub_str_1[3] && $sub_str_1[5]);
push (@{ $COORDINATES{$net}{Y} },$sub_str_1[4],$sub_str_1[6]) if ($sub_str_1[4] && $sub_str_1[6]);
my $difference=$sub_str_1[1]-$sub_str_2[1];
if($sub_str_1[3]=~/^-/){
push (@{ $MATCHED_RESISTORS{$net}{$sub_str_1[2].$sub_str_1[3].$sub_str_1[4].$sub_str_1[5].$sub_str_1[6]} }, $file_1_net,$file_2_net,$difference);
}else{
push (@{ $MATCHED_RESISTORS{$net}{$sub_str_1[2]."-".$sub_str_1[3].$sub_str_1[4].$sub_str_1[5].$sub_str_1[6]} }, $file_1_net,$file_2_net,$difference);
}
push (@{ $IR_VALUES{$net} }, $sub_str_2[1]);
last;
}
}
}
print max @{ $IR_VALUES{$net} };
print "\nloop count is $loop_count\n";
$loop_count = 0;
# <>;
}
I ran a profiler on the code. 我在代码上运行了探查器。 Below is the output on the above part of code: 以下是上述代码部分的输出:
Some statistics: 一些统计:
element_1: 14 element_1: 316 element_1: 8
My question is really simple: How do I make my code run faster? 我的问题很简单:如何使我的代码运行更快?
Sample Data File_1: 样本数据文件_1:
r6_2389 1.29029e-05 ay_viabar 23.076 57.755 22.628 57.755 4.5 0 0 3.68449e-06 -5.99170336965613
r6_2397 1.29029e-05 ay_viabar 22.948 57.755 22.628 57.755 4.5 0 0 3.68449e-06 -5.99170336965613
r6_2400 1.29029e-05 ay_viabar 22.82 57.755 22.628 57.755 4.5 0 0 3.68449e-06 -5.99170336965613
r6_2403 1.29029e-05 ay_viabar 22.692 57.755 22.628 57.755 4.5 0 0 3.68449e-06 -5.99170336965613
r6_971 1.3279e-05 c1 9.492 45.742 -0.011 46.779 0.001 9.5589 10 0.0508653
Sample Data File_2: 样本数据文件_2:
r6_9261 0.00206167 ay_viabar 23.076 57.755 22.628 57.755 4.5 0 0 0.0207546
r6_9258 0.00206167 ay_viabar 22.948 57.755 22.628 57.755 4.5 0 0 0.0161057
r6_9399 0.00206167 ay_viabar 22.82 57.755 22.628 57.755 4.5 0 0 0.0127128
r6_9486 0.00206167 ay_viabar 22.692 57.755 22.628 57.755 4.5 0 0 0.0103186
r6_1061 1.3279e-05 cb_pc_viadg -6.696 44.157 -0.159 44.847 0.001 0 0 0
Sample Output: 样本输出:
r6_9261 0.00206167 ay_viabar 23.076 57.755 22.628 57.755 4.5 0 0 0.0207546
r6_9258 0.00206167 ay_viabar 22.948 57.755 22.628 57.755 4.5 0 0 0.0161057
r6_9399 0.00206167 ay_viabar 22.82 57.755 22.628 57.755 4.5 0 0 0.0127128
r6_9486 0.00206167 ay_viabar 22.692 57.755 22.628 57.755 4.5 0 0 0.0103186
The sample output is basically pushed into another hash which is further processed. 样本输出基本上被推入另一个哈希中,进一步处理。 But building up this hash consumes about 90% of the total run-time as per the profiler. 但是,根据探查器,建立此哈希会消耗大约90%的总运行时间。
OK, so my first thought is - you've a 3 deep loop, and that will always be inefficient. 好的,所以我的第一个想法是-您有3个深度的循环,这样总是效率低下。 We can probably trade memory for a lot of speed there. 我们可能可以在那里以很高的速度交换内存。
Assuming the 'bigger' file is 'sample_1', otherwise swap them. 假设“更大”文件为“ sample_1”,否则交换它们。
In this example - sample_2
will consume memory proportionate to the number of rows - so we ideally want that to be the smaller file. 在此示例中sample_2
将消耗与行数成比例的内存-因此,理想情况下,我们希望它是较小的文件。 You may need to swap the match/test around, depending on whether file1 cols 5,6,3,4 matches file2 or vice versa. 您可能需要交换匹配/测试,具体取决于file1 cols 5,6,3,4与file2匹配,反之亦然。
But hopefully this illustrates a useful concept for solving your problem, if not entirely solving it? 但是,希望这说明了一个解决问题的有用概念,即使不能完全解决问题吗?
Something like this will do the trick: 这样的事情将达到目的:
#!/usr/bin/env perl
use strict;
use warnings;
my %is_match;
open ( my $sample_1, '<', 'sample1.txt' ) or die $!;
open ( my $sample_2, '<', 'sample2.txt' ) or die $!;
# first of all, column 2 , 3,4,5,6 should match between 2 files.
# and then print out both matching lines from two files.
# column 3,4,5,6 from one file can match with column 5,6,3,4.
while ( <$sample_2> ) {
my @row = split;
#insert into hash
#this would be much clearer if the fields were named rather than numbered
#I think.
$is_match{$row[3]}{$row[4]}{$row[5]}{$row[6]}++;
$is_match{$row[5]}{$row[6]}{$row[3]}{$row[4]}++;
}
while ( <$sample_1> ) {
my @row = split;
#print the current line if it matches from the hash above.
print if $is_match{$row[3]}{$row[4]}{$row[5]}{$row[6]};
}
Because this iterates each file once, it should be a lot faster. 因为这会使每个文件循环一次,所以它应该快得多。 and because one of your files is small, then that's the one you should read first into memory. 并且由于其中一个文件很小,因此您应该首先将其读入内存。
With your sample data as provided, this gives you the desired output. 使用提供的样本数据,这将为您提供所需的输出。
The first loop reads though the file, selects your interest fields and inserts them into a hash, based on your 4 keys. 第一个循环读取文件,选择您感兴趣的字段,然后根据您的4个键将它们插入到哈希中。
And then it does so again for the other set of valid matching keys. 然后针对另一组有效的匹配密钥再次执行此操作。
The second loop reads the other file, selects the keys and just checks if either combination exists in the hash. 第二个循环读取另一个文件,选择键,然后仅检查哈希中是否存在任何组合。 And prints the current line if it does. 并打印当前行(如果有)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.