简体   繁体   English

如何加快此Perl脚本的grep速度

[英]how to speed up the grep speed of this Perl script

Currently I have a script that need to go extraction logs. 目前,我有一个需要提取日志的脚本。 Below is the Perl code snippet: The script traverse every server folder and grep the necessary information. 下面是Perl代码片段:该脚本遍历每个服务器文件夹,并grep必要的信息。 The problem is that when the number of logs maybe huge, the script may take very long time to finish. 问题是,当日志数量可能很大时,脚本可能需要很长时间才能完成。 The bottle neck is this line: 瓶颈是这条线:

@leaf_lines = qx($grep -l "stagename = $current_stage" $grep_path| xargs $grep "Keywords")

I am wondering if there is any way to speed up this operation? 我想知道是否有任何方法可以加快此操作? The script is running on a server with 8 cores per CPU and 8G memory, is there any way to use these resources? 该脚本运行在具有每个CPU 8个内核和8G内存的服务器上,是否有任何方法可以使用这些资源?

my $grep = ($leaflog_zipped) ? "zgrep" : "grep" ;
my %leaf_info;
my @stage = ("STAGE1", "STAGE1", "STAGE3");
foreach my $leaf_dir (@leaf_dir_list){
    my $grep_path = $log_root_dir . "/$leaf_dir/*" ;          
    foreach my $current_stage (@stage){
        my @leaf_lines;
        @leaf_lines = qx($grep -l "stagename = $current_stage" $grep_path| xargs $grep "Keywords"); ## how to improve the grep speed?  
        foreach (@leaf_lines){
            if(...){
                $leaf_info{$current_stage}{xxx} = xxxx;
            }
        }    
    }
}

For starters - I'd say don't 'shell out' to grep - perl has perfectly good built in pattern matching and regular expressions, and includes the ability to precompile a regular expression. 对于初学者-我会说不要“掏腰包”给grep-perl具有完美的内置模式匹配和正则表达式,并且具有预编译正则表达式的能力。

http://perldoc.perl.org/perlop.html#Regexp-Quote-Like-Operators http://perldoc.perl.org/perlop.html#Regexp-Quote-Like-Operators

Also - you can run perl in parallel fairly easily using threading or forks, which makes better use of your CPU resources. 另外-您可以使用线程或fork相当轻松地并行运行perl,这可以更好地利用您的CPU资源。

However I will point out - things like grep aren't generally CPU related problems. 但是我要指出-像grep这样的东西通常不是与CPU相关的问题。 CPUs are pretty fast these days, where filesystems are generally a lot slower. 如今,CPU的速度相当快,而文件系统的速度通常要慢得多。 You will probably be spending more of your time reading data from disk than you will processing it, by quite a large margin. 从磁盘读取数据所花费的时间可能要比处理数据花费的时间多得多。

So the thing that will likely be giving you a lot of grief is you grep multiple times. 因此,可能会给您带来很多麻烦的是您多次grep。

my $grep_path = $log_root_dir . "/$leaf_dir/*" ;          
foreach my $current_stage (@stage)

Each element of @stage triggers another grep, and it's doing so for every file in that directory. @stage每个元素@stage触发另一个grep,并且会对该目录中的每个文件执行此操作。 And then you're grepping again . 然后,您再次感到毛骨悚然。

That's a poor algorithm, because you'll be reading every file multiple times. 这是一个糟糕的算法,因为您将多次读取每个文件。 Why not instead do something like: 为什么不做这样的事情:

#could do this with map - I haven't for clarity. 
my %stages;
$stages{'STAGE1'}++;
$stages{'STAGE2'}++;
$stages{'STAGE3'}++;

foreach my $file ( glob $grep_path ) {
    open( my $input_fh, "<", $file ) or die $!;
    while (<$input_fh>) {
        if (m/current_stage/) {
            my ($file_stage) = (
                m/stagename = (\w+)/;
            );
            if ( $stages{$file_stage} ) {
                # do something here
            }
        }
    }
}

That way - whilst you do have to read every file - you only do so once. 这样,虽然您必须读取每个文件,但您只需读取一次。

Yes, definitely. 当然是。 Simply replace xargs with GNU Parallel or another similar program (there are multiple programs named parallel on some Linux systems, so be mindful of which one you have; GNU Parallel is probably the best). 只需用GNU Parallel或其他类似程序替换xargs (在某些Linux系统上有多个名为parallel程序,因此请注意您拥有哪个程序; GNU Parallel可能是最好的)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM