简体   繁体   中英

Perl Regex match lines that contain multiple words

I am trying to develop a rather fast full text search. It will read the index, and should ideally run the matching in just one regex.

Therefore, I need a regex that matches lines only if certain words are contained.

Eg for

my $txt="one two three four five\n".
        "two three four\n".
        "this is just a one two three test\n";

Only line one and three should be matched, since line two does not contain the word "one".

Now I could go through each line in a while() or use multiple regexes, but I need my solution to be fast.

The example from here: http://www.regular-expressions.info/completelines.html ("Finding Lines Containing or Not Containing Certain Words")

is what I need. However, I can't get it to work in Perl. I tried a lot, but it just doesn't come up with any result.

my $txt="one two three four five\ntwo three four\nthis is just a one two three test\n";
my @matches=($txt=~/^(?=.*?\bone\b)(?=.*?\btwo\b)(?=.*?\bthree\b).*$/gi);
print join("\n",@matches);

Gives no output.

In summary: I need a regex to match lines containing multiple words, and returning these whole lines.

Thanks in advance for your help! I tried so much, but just don't get it to work.

The ^ and $ meta char by default match only the start- and end of the input. To let them match the start and end of lines, enable the m (MULTI-LINE) flag:

my $txt="one two three four five\ntwo three four\nthis is just a one two three test\n";
my @matches=($txt=~/^(?=.*?\bone\b)(?=.*?\btwo\b)(?=.*?\bthree\b).*$/gim);
print join("\n",@matches);

produces:

one two three four five
this is just a one two three test

But, if you really want a fast search, regex (with a lot of look aheads) is not the way to go, if you ask me.

Code:

use 5.012;
use Benchmark qw(cmpthese);
use Data::Dump;
use once;

our $str = <<STR;
one thing
another two
three to go
no war
alone in the dark
war never changes
STR

our @words = qw(one war two);

cmpthese(100000, {
    'regexp with o'             => sub {
        my @m;
        my $words = join '|', @words;
        @m = $str =~ /(?!.*?\b(?:$words)\b)^(.*)$/omg;
        ONCE { say 'regexp with o:'; dd @m }
    },
    'regexp'                    => sub {
        my @m;
        @m = $str =~ /(?!.*?\b(?:@{ [ join '|', @words ] })\b)^(.*)$/mg;
        ONCE { say 'regexp:'; dd @m }
    },
    'while'                     => sub {
        my @m;
        @m = grep $_ !~ /\b(?:@{ [ join '|',@words ] })\b/,(split /\n/,$str);
        ONCE { say 'while:'; dd @m }
    },
    'while with o'              => sub {
        my @m;
        my $words = join '|',@words;
        @m = grep $_ !~ /\b(?:$words)\b/o,(split /\n/,$str);
        ONCE { say 'while with o:'; dd @m }
    }
})

Resulting:

regexp:
("three to go", "alone in the dark")
regexp with o:
("three to go", "alone in the dark")
while:
("three to go", "alone in the dark")
while with o:
("three to go", "alone in the dark")
                 Rate        regexp regexp with o         while  while with o
regexp        19736/s            --           -2%          -40%          -60%
regexp with o 20133/s            2%            --          -38%          -59%
while         32733/s           66%           63%            --          -33%
while with o  48948/s          148%          143%           50%            --

Сonclusion

So, variant with while is a more faster than variant with regexp.``

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM