Perl正则表达式替代哈希

Question

Is there an efficient way to substitute a bunch a strings using values from a Perl hash? 有没有一种有效的方法来使用Perl哈希中的值替换字符串？

For example, 例如，

$regex{foo} = "bar";
$regex{hello} = "world";
$regex{python} = "perl";

open(F, "myfile.txt");
while (<F>) {
      foreach $key (keys %regex) {
            s/$key/$regex{$key}/g;
      }
}
close(F);

Is there a way to accomplish the above in Perl? 有没有办法在Perl中完成上述操作？

Answer 1

First question: are you sure that what you have is inefficient ? 第一个问题：你确定你所拥有的是低效率的吗？

Second, the most obvious next step would be to pull everything into a single regex: 其次，最明显的下一步是将所有内容整合到一个正则表达式中：

my $check = join '|', keys %regex;

And then you can do the substitution as: 然后你可以做替换为：

s/($check)/$regex{$1}/g;

This can still be "slow" with sufficient overlap of the keys where the regex engine has to recheck the same letters constantly. 由于正则表达式引擎必须不断地重新检查相同的字母，所以键仍然可以“缓慢”。 You can possibly use something like Regexp::Optimizer to eliminate the overlap. 您可以使用Regexp :: Optimizer之类的东西来消除重叠。 But the cost of optimising may be more than the cost of just doing everything, depending on how many changes (key/values in your hash) and how many lines you're modifying. 但是优化的成本可能不仅仅是执行所有操作的成本，具体取决于更改的数量（哈希中的键/值）以及您要修改的行数。 Premature optimisation-- ! 过早的优化 - ！

Note that, of course, your example code isn't doing anything with the text after the substitution. 请注意，当然，您的示例代码在替换后没有对文本执行任何操作。 It won't modify the file in-place, so I'm assuming you're handling that separately. 它不会就地修改文件，所以我假设您正在单独处理它。

Answer 2

To prove the point of eval and also out of curiosity, I ran some tests with the OP's code vs. the $regex{$1} approach vs. the eval approach. 为了证明eval的观点以及出于好奇，我使用OP的代码与$regex{$1}方法和eval方法进行了一些测试。

First off, there seems to be little value in cramming every possible token in a (token|token|...) match expression. 首先，在(token|token|...)匹配表达式中填充每个可能的标记似乎没什么价值。 Perl needs to check against all tokens at once -- it is debatable how much more efficient this is than simply checking every token at a time and doing the replacement with a hardcoded value. Perl需要立即检查所有令牌 - 这是有争议的，这比仅仅一次检查每个令牌并使用硬编码值进行替换要有效得多。

Secondly, doing $regex{$1} means the hashmap key is extracted on every match. 其次，执行$regex{$1}意味着在每次匹配时都会提取hashmap键。

Anyway, here are some numbers (ran this on strawberry 5.12, with a 4MB file of 100K lines): 无论如何，这里有一些数字（在草莓5.12上运行，带有4MB文件的100K行）：

The $regex{$1} approach takes 6 seconds (5 seconds with /go instead of /g) $regex{$1}方法需要6秒（使用/ go而不是/ g为5秒）
The tie approach takes 10 seconds tie方法需要10秒钟
The OP approach takes a bit under 1 second (with /go instead of /g) OP方法需要不到1秒 （使用/ go而不是/ g）
The eval approach takes less than 1 second (faster than the OP code) eval方法只需不到1秒 （比OP代码快）

This is the eval approach: 这是eval方法：

$regex{foo} = "bar";
$regex{hello} = "world";
$regex{python} = "perl";
$regex{bartender} = "barista";

$s = <<HEADER;
\$start = time;
open(F, "myfile.txt");
while (<F>) {
HEADER

foreach $key (keys %regex) {
   $s .= "s/$key/$regex{$key}\/go;\n"
}

$s .= <<FOOTER;
print \$_;
}
close(F);
print STDERR "Elapsed time (eval.pl): " . (time - \$start) . "\r\n";
FOOTER

eval $s;

Answer 3

Define a regexp that matches any of the keys. 定义与任何键匹配的正则表达式。

$regex = join("|", map {quotemeta} keys %regex);

Replace any match of $regex by $regex{$1} . 将$regex的任何匹配替换为$regex{$1} 。

s/($regex)/$regex{$1}/go;

Omit the o modifier if $regex changes during the execution of the program. 如果$regex在程序执行期间发生更改， $regex省略o修饰符。

Note that if there are keys that are a prefix of another key (eg f and foo ), whichever comes first in the joined regexp will be seen as a match (eg f|foo matches f but foo|f matches foo in foobar ). 请注意，如果存在作为另一个键的前缀的键（例如f和foo ），则在连接的正则表达式中首先出现的将被视为匹配（例如f|foo匹配f但foo|f匹配foobar中的foo ）。 If that can happen, you may need to sort keys %regex according to which match you want to win. 如果发生这种情况，您可能需要根据您想要获胜的匹配对keys %regex进行排序。 (Thanks to ysth for pointing this out.) （感谢ysth指出这一点。）

Answer 4

perl -e '                                                         \
          my %replace =  (foo=>bar, hello=>world, python=>perl);  \
          my $find    =  join "|", sort keys %replace;            \
          my $str     =  "foo,hello,python";                      \
          $str        =~ s/($find)/$replace{$1}/g;                \
          print "$str\n\n";                                       \
        '

Something you may want to consider is not going line-by-line of the file, but instead processing the whole file at once and use the /s modifier on your regex for single-line mode. 你可能想要考虑的东西不是逐行处理文件，而是一次处理整个文件，并在正则表达式上使用/s修饰符进行单行模式。

Answer 5

What you have works as is, so it's not clear what your request is. 你的工作原理是什么，因此不清楚你的要求是什么。

One catch: The code you posted may have problems with double substitutions depending on the contents of %regex and/or $_ . 一个问题：您发布的代码可能存在双重替换问题，具体取决于%regex和/或$_ 。 For example, 例如，

my %regex = (
   foo => 'bar',
   bar => 'foo',
);

The solution is to move the foreach into the pattern, so to speak. 解决方案是将foreach移动到模式中，可以这么说。

my $pat =
   join '|',
    map quotemeta,  # Convert text to regex patterns.
     keys %regex;

my $re = qr/$pat/;  # Precompile for efficiency.

my $qfn = 'myfile.txt'
open(my $fh, '<', $qfn) or die "open: $qfn: $!";
while (<$fh>) {
   s/($re)/$regex{$1}/g;
   ... do something with $_ ...
}

Answer 6

The begin: 开始：

#!/usr/bin/perl
use strict;
use Tie::File;

my %tr=(   'foo' => 'bar',
            #(...)
        );
my $r =join("|", map {quotemeta} keys %tr);
$r=qr|$r|;

with big files use: 使用大文件：

tie my @array,"Tie::File",$ARGV[0] || die;
for (@array) { 
    s/($r)/$tr{$1}/g;
}
untie @array;

with small files use: 使用小文件：

open my $fh,'<',$ARGV[0] || die;
local $/ = undef;
my $t=<$fh>;
close $fh;
$t=~s/($r)/$tr{$1}/g;
open $fh,'>',$ARGV[0] || die;
print $fh $t;
close $fh;

Answer 7

This is an old question, so I'm surprised no one has yet suggested the obvious: pre-compile each of the regexps (ie the hash keys). 这是一个老问题，所以我很惊讶没有人提出明显的建议：预编译每个正则表达式（即散列键）。

$regex{qr/foo/} = 'bar';
$regex{qr/hello/} = 'world';
$regex{qr/python/} = 'perl';

open(F, "myfile.txt");
while (<F>) {
      foreach $key (keys %regex) {
            s/$key/$regex{$key}/g;
      }
}
close(F);

or for (IMO) greater readability: 或（IMO）更高的可读性：

%regex = (
    qr/foo/    => 'bar',
    qr/hello/  => 'world',
    qr/python/ => 'perl',
);

If you know that there can only be one possible match per input line then skipping the remaining regexps with last after a successful match will also help if there are a lot of keys. 如果你知道，只能每输入线，然后跳过与其余的正则表达式的一个可能的比赛last一个成功的比赛结束后也将帮助，如果有很多按键。 eg inside the for loop: 例如在for循环中：

s/$key/$regex{$key}/g && last;

Perl正则表达式替代哈希

问题描述

7 个解决方案

解决方案1
5 2011-06-21 19:40:45

解决方案2
4 2011-06-22 06:44:31

解决方案3
3 2011-06-21 19:41:33

解决方案4
1 2011-06-21 20:20:12

解决方案5
1 2011-06-21 20:44:42

解决方案6
1 2011-06-21 21:05:02

The begin: 开始：

with big files use: 使用大文件：

with small files use: 使用小文件：

解决方案7
0 2015-10-29 02:02:51

Perl正则表达式替代哈希

问题描述

7 个解决方案

解决方案1 5 2011-06-21 19:40:45

解决方案2 4 2011-06-22 06:44:31

解决方案3 3 2011-06-21 19:41:33

解决方案4 1 2011-06-21 20:20:12

解决方案5 1 2011-06-21 20:44:42

解决方案6 1 2011-06-21 21:05:02

The begin: 开始：

with big files use: 使用大文件：

with small files use: 使用小文件：

解决方案7 0 2015-10-29 02:02:51

解决方案1
5 2011-06-21 19:40:45

解决方案2
4 2011-06-22 06:44:31

解决方案3
3 2011-06-21 19:41:33

解决方案4
1 2011-06-21 20:20:12

解决方案5
1 2011-06-21 20:44:42

解决方案6
1 2011-06-21 21:05:02

解决方案7
0 2015-10-29 02:02:51