简体   繁体   中英

Perl regular expression / substitution for nested phrases

I have a perl script that processes a text file line-by-line and converts phrases within those lines to links (specifically in mediawiki mark-up, but I suspect any mark-up would have the same issue). Where I get stuck is when one phrase is a subset of another. In these cases too many links are created.

For example, if "General Committee" and "Annual General Committee Meeting" are two of the phrases:

The General Committee meeting shall meet once a month.

is converted correctly to:

The [[#GC|General Committee]] meeting shall meet one a month.

However,

The Annual General Committee Meeting shall be held in May.

is incorrectly converted to:

The [[#AGCM|Annual [[#GC|General Committee]] Meeting]] shall be held in May.

That is, my script is finding the phrase "General Committee" within "Annual General Committee Meeting" and inserting a link where I don't want it. There should only be a link to the AGCM in this example.

The relevant perl code is:

my($line) = $_;
foreach $phrase (keys(%phrases))  # the phrases to replace mapped to their links
{
    my($link) = $phrases{$phrase};
    if ($line =~ m/$phrase/)
    {
        $line =~ s/$phrase/[[#$link|$phrase]]/g;
    }
}

Any suggestions on how to avoid matching / substituting when one phrase can be found with another?

UPDATE: Clarification based on some of the questions: Each phrase stands alone; there is no priority of one over another. Taking the longest over the shortest is sufficient to get what I need.

You should build a regular expression that matches any of the hash keys in one comparison.

This program shows the idea. The keys are sort by decreasing length so that the longest match is found first, and then concatenated with the | alternation character as a separator.

Then it is simply a matter of finding all occurrences of the built pattern and replacing it with the corresponding hash element value. This can be done in a single substitution instead of needing a loop.

Note that you may want to consider interposing a map to use \\s+ in place of whitespace, and perhaps put \\b before and after the strings to ensure that the string matched isn't part of a longer word. Also the /i regex modifier may be relevant to allow case-independent matching.

use strict;
use warnings;

my %phrases = (
  'General Committee' => '[[#GC|General Committee]]',
  'Annual General Committee Meeting' => '[[#AGCM|Annual General Committee Meeting]]',
);

my $text = <<END;
The General Committee meeting shall meet once a month.
The Annual General Committee Meeting shall be held in May.
END

my $regex = join '|', sort { length $b <=> length $a } keys %phrases;

$text =~ s/($regex)/$phrases{$1}/g;

print $text, "\n";

output

The [[#GC|General Committee]] meeting shall meet once a month.
The [[#AGCM|Annual General Committee Meeting]] shall be held in May.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM