简体   繁体   中英

Perl check if a line starting with a word from an array and return the matched value to variable

I checked following topic Perl check a line contains at list one word of an array , but I still confuse how to make it more efficient for my case.

I use example from topic above.

I have an array, called @exampleWords:

my @exampleWords = ("balloon", "space", "monkey", "fruit" );

and I have a line contains of a sentence for example:

my $line = "monkey space always unlimited";

How do I can check if $line starting with a match word in array, and return the matched word into a variable?

in example above, the matched word is "monkey".

current solution in my mind is: loop each word in array and check if the $line starting with a $word.

my $matchWord = "";
foreach my $word(@exampleWords) {
  if ($line =~ /^$word/) {
    $matchWord = $word;
    last;
  }
}

I am still looking more efficient solution..

thank you...

In principle, you have to iterate over possible words to match. However, you can also construct an alternation regex pattern with them so that the regex engine starts once, unlike with the loop where it is started for every iteration. Also, now the iteration goes by highly optimized C code.

How do these compare? Let's benchmark them, using the core module Benchmark .

For a tiny array, matching around its middle (your example)

use warnings;
use strict;

use Benchmark qw( cmpthese );

my @ary = ("balloon", "space", "monkey", "fruit");
my $line = "monkey space always unlimited";

sub regex {
    my ($line, @ary) = @_;
    my $match; 
    my $re = join '|', map { quotemeta } @ary;
    if ($line =~ /^($re)/) {
        $match = $1;
    }   
    return $match;
}   

sub loop {
    my ($line, @ary) = @_;
    my $match; 
    foreach my $word (@ary) {
        if ($line =~ /^$word/) {  # see note at end
            $match = $word;
            last;
        }   
    }   
    return $match;
}   

cmpthese(-10, {
    regex => sub { regex ($line, @ary) },
    loop  => sub { loop  ($line, @ary) },
}); 

This produces, on both a very good machine with v5.16 and on an older one with v5.10

Rate  loop regex
loop  222791/s    --  -70%
regex 742962/s  233%    --

Thus regex is way more efficient.

For a 40 times larger array, matching around the middle

I build this array by @ary = qw(...) x 20 , then add a word ( 'AHA' ), then repeat 20 more times. I prepend that very word to the string, so that's what gets matched. I make the string much larger, too, even though this shouldn't matter for matching.

In this case the regex is even more convincing

Rate  loop regex
loop   9300/s    --  -82%
regex 50873/s  447%    --

and yet more so with v5.10 on the older machine, with 574% .

On v5.27.2 the regex is faster by 1188% , so by a clean order of magnitude. But it is the rate of the loop that drops to only 6723/s , against the above 9330/s . So this only shows that the regex "startup" is more expensive in newer Perls, thus the loop falls further behind.

For the same large array, with the match word near its beginning

I move the match-word AHA in the array right past the original 4-word list

Rate  loop regex
loop  36710/s    --   -3%
regex 37666/s    3%    --

So the match needs to happen very, very early so that the loop catches up with the regex. While this can happen often in specific use cases it cannot be expected in general, of course.

Note that the regex had far less work to do as well. Thus it's clear that the loop's problem is that it starts the regex engine anew for every iteration. Here it only had to do it a few times and the regex's advantage all but evaporated, even though it also matched much sooner.


As for programmer's efficiency, take your pick. There are yet other ways using higher level libraries so that you don't have to write the loop. For instance, using core List::Util

use List::Util qw(first);

my $match = first { $line =~ /^$_/ } @ary;

This benchmarks between the same and around 10% slower than your loop when added.


A note on regex used in the question.

If the first word in $line is puppy the regex /^$word/ will match it with pup . This may or may not be intended (but think of flu for fluent instead), but if it isn't it can be fixed by adding the word boundary anchor \\b ,

$line =~ /^$word\b/

The same can be used with the alternation pattern, which was written so to mimic the code in the question. So add the word boundary anchor, for /^($re)\\b/ .

Another way is to sort the list by the length of words, sort { length $b <=> length $a } @ary , per Borodin 's comment. This may affect the problem in a more complex way, please consider.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM