简体   繁体   中英

Multiple substitutions with a single regular expression with caring about position in perl

i want to read in a file, with lines of the form: "string1 string2 string3" and to substitute several chars off it, (but every chair should get substitue once) with for example these rules: tsch=> tch, ch> h, ki=> ky (but just if ki is at the end of a 'word') so "tschaiki" should get tchaiky and not thaiky (which happens when using a for loop or several single substitute commands)

i know this question was asked before and got solved by creating an hash in perl.

$line=<>
my %replace =(j=> "y", ss=> "s", u=> "ou", tsch=> "ch"); #short versions of the rules
my $regex = join "|", keys %replace;    

$regex = qr/$regex/;
$line=~s/($regex)/$replace{$1}/g;

This also work so far for me, but i would like that some characters should be only substitute, at the end of the string. But this cause problems: Ive extended the code of the before with a second regex and hash just for the endings:

 my %replace_end =(ia=> "iya", ki=> "ky",ei=> "ey" );
 my $regex_end = join "|", keys %replace_end;
 $regex_end = qr/$regex_end/; 
 $line=~s/($regex_end)$/$replace_end{$1}/g;  # saying just to substitute at the end 

my whole code is as following, but either it comes to exception or the endings got ignored (i think the code without filehandling & while loop did actually work):

#!/usr/bin/perl
use strict;
use warnings;

open(INP,"<:utf8","dt_namen.txt"); 
open(OUT,">:utf8","dt_zu_engl.txt");

my %replace =(j=> "y", ss=> "s", tsch=> "ch", sch => "sh", c => "k", J="Y", Ss=>"s"); 
 my $regex = join "|", keys %replace;  
 $regex = qr/$regex/;

 my %replace_end =(ki=> "ky",ei=> "ey" );
 my $regex_end = join "|", keys %replace_end;
 $regex_end = qr/$regex_end/; 

while(my $line= <INP>){
 $line=~s/($regex)/$replace{$1}/g;
 $line=~s/($regex_end)$/$replace_end{$1}/g;  # saying just to substitute at the end 
 print $line;
 print OUT "$line";
}
close INP;
close OUT;

Your code has a potential problem in that the order of replacement is undefined. If two patterns match at the same position, there is no knowing which one will match. It all depends on which comes first in the regex, and hashes don't have a defined order, so right now there is no guaranteed behavior.

Fix this by performing a sort when you construct the regex:

my $regex = join "|", sort {length($b) <=> length($a)} keys %replace;

This will sort the terms in descending order of length, so you will be sure to always match the longest term first.

Update: to replace only at the end of the string, try this:

my $regex_end = join "|", map { qr/$_$/ } keys %replace_end;

It puts a $ (matching the end of the string) at the end of each term.

Or if you mean, only replace at the end of the word , do this:

my $regex_end = join "|", map { qr/$_\b/ } keys %replace_end;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM