简体   繁体   中英

Perl Regex match something, but make sure that the match string does not contain a string

I have files with sequences of conversations where speakers are tagged. The format of my files is:

<SPEAKER>John</SPEAKER>
I am John
<SPEAKER>Lisa</SPEAKER>
And I am Lisa

I am now looking to identify the first sequence in each document in which John speaks and Lisa speaks right afterwards (and I then want to then retain the entire part of the document that follows this sequence, including the sequence).

I built this regex:

^.*?(<SPEAKER>John<\\/SPEAKER>.*?<SPEAKER>Lisa<\\/SPEAKER>.*)

but it of course also captures the case where there is a sequence of speakers is John-Michael-Lisa, ie where there is someone speaking between John and Lisa.

How can I get the right match?

Here is a regex you can use to match what you describe:

(<SPEAKER>John<\/SPEAKER>(?:(?!<SPEAKER>).)*<SPEAKER>Lisa<\/SPEAKER>.*)

And a small demo showing that it works: https://regex101.com/r/iW8vS5/1

However, as both kchinger and owler mentioned, regex probably isn't the best way to do this. A regex solution would likely be significantly slower than a small snippet of code for any long document.

This isn't a purely regex solution, maybe someone else can do that, but instead I wrote a small loop to check each line. If it finds what you want, it will keep the rest of the document. You would need to feed it the correct sequence if it wasn't a full document. A regex to do what you want might be kind of slow since it will be relatively complicated, but you'd have to benchmark against a pure regex solution (if someone comes up with one) if speed is important.

edit to note: ?!Lisa is a negative lookahead if you haven't seen it. Some combined negative lookaheads might be what you need to use to do it in one regex, but good luck reading it later.

open(my $input,"input2.txt")||die "can't open the file";

my $output = "";
my $wanted = 0;
while(<$input>)
{
    $wanted = 1 if(/<SPEAKER>John<\/SPEAKER>/);
    $wanted = 2 if(/<SPEAKER>Lisa<\/SPEAKER>/ && $wanted == 1);
    if(/<SPEAKER>(?!Lisa)/ && /<SPEAKER>(?!John)/ && $wanted == 1)
    {
        $wanted = 0;
        last;
    }
    $output = $output . $_ if($wanted);
}

print "$output" if $wanted;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM