简体   繁体   English

如何在同一个字符串上有效地处理多个Perl搜索/替换操作?

[英]How can I efficiently handle multiple Perl search/replace operations on the same string?

So my Perl script basically takes a string and then tries to clean it up by doing multiple search and replaces on it, like so: 所以我的Perl脚本基本上是一个字符串然后尝试通过多次搜索并替换它来清理它,如下所示:

$text =~ s/<[^>]+>/ /g;
$text =~ s/\s+/ /g;
$text =~ s/[\(\{\[]\d+[\(\{\[]/ /g;
$text =~ s/\s+[<>]+\s+/\. /g;
$text =~ s/\s+/ /g;
$text =~ s/\.*\s*[\*|\#]+\s*([A-Z\"])/\. $1/g; # replace . **** Begin or . #### Begin or ) *The 
$text =~ s/\.\s*\([^\)]*\) ([A-Z])/\. $1/g; # . (blah blah) S... => . S...

As you can see, I'm dealing with nasty html and have to beat it into submission. 正如你所看到的,我正在处理令人讨厌的HTML,并且必须将其击败。

I'm hoping there is a simpler, aesthetically appealing way to do this. 我希望有一种更简单,美观的方式来做到这一点。 I have about 50 lines that look just like what is above. 我有大约50行看起来就像上面那样。

I have solved one version of this problem by using a hash where the key is the comment, and the hash is the reg expression, like so: 我通过使用哈希来解决了这个问题的一个版本,其中键是注释,哈希是reg表达式,如下所示:

%rxcheck = (
        'time of day'=>'\d+:\d+', 
    'starts with capital letters then a capital word'=>'^([A-Z]+\s)+[A-Z][a-z]',
    'ends with a single capital letter'=>'\b[A-Z]\.'
}

And this is how I use it: 这就是我使用它的方式:

 foreach my $key (keys %rxcheck) {
if($snippet =~ /$rxcheck{ $key }/g){ blah blah  }
 }

The problem comes up when I try my hand at a hash that where the key is the expression and it points to what I want to replace it with... and there is a $1 or $2 in it. 当我尝试使用哈希表示密钥是表达式并且它指向我想要替换它的那个问题时,问题就出现了......并且它有一个1美元或2美元。

%rxcheck2 = (
        '(\w) \"'=>'$1\"'
}

The above is to do this: 以上是这样做的:

$snippet =~ s/(\w) \"/$1\"/g;

But I can't seem to pass the "$1" part into the regex literally (I think that's the right word... it seems the $1 is being interpreted even though I used ' marks.) So this results in: 但我似乎无法将“$ 1”部分传递到正则表达式字面上(我认为这是正确的单词......即使我使用'标记,似乎正在解释$ 1。)因此,这导致:

if($snippet =~ /$key/$rxcheck2{ $key }/g){  }

And that doesn't work. 这不起作用。

So 2 questions: 那2个问题:

Easy: How do I handle large numbers of regex's in an easily editable way so I can change and add them without just cut and pasting the line before? 简单:如何以一种易于编辑的方式处理大量的正则表达式,这样我就可以更改和添加它们而不必仅仅剪切和粘贴线条?

Harder: How do I handle them using a hash (or array if I have, say, multiple pieces I want to include, like 1) part to search, 2) replacement 3) comment, 4) global/case insensitive modifiers), if that is in fact the easiest way to do this? 更难:如何处理它们使用哈希(或数组,如果我有,例如,我要包括多个部分,如1)部分搜索,2)替换3)评论,4)全局/不区分大小写修饰符),如果这实际上是最简单的方法吗?

Thanks for your help - 谢谢你的帮助 -

Problem #1 问题#1

As there doesn't appear to be much structure shared by the individual regexes, there's not really a simpler or clearer way than just listing the commands as you have done. 由于单个正则表达式似乎没有多少共享结构,因此实际上并没有比仅列出命令更简单或更清晰的方式。 One common approach to decreasing repetition in code like this is to move $text into $_ , so that instead of having to say: 减少像这样的代码重复的一种常见方法是将$text移动到$_ ,这样就不必说:

$text =~ s/foo/bar/g;

You can just say: 你可以说:

s/foo/bar/g;

A common idiom for doing this is to use a degenerate for() loop as a topicalizer: 这样做的一个常见习惯是使用degenerate for()循环作为一个局部化器:

for($text)
{
  s/foo/bar/g;
  s/qux/meh/g;
  ...
}

The scope of this block will preserve any preexisting value of $_ , so there's no need to explicitly local ize $_ . 该块的范围将保留$_任何预先存在的值,因此不需要显式local ize $_

At this point, you've eliminated almost every non-boilerplate character -- how much shorter can it get, even in theory? 在这一点上,你已经消除了几乎所有非样板特征 - 即使在理论上它也能缩短多少?

Unless what you really want (as your problem #2 suggests) is improved modularity , eg, the ability to iterate over, report on, count etc. all regexes. 除非你真正想要的(正如你的问题#2所暗示的)是改进的模块化 ,例如,迭代,报告,计数等所有正则表达式的能力。

Problem #2 问题#2

You can use the qr// syntax to quote the "search" part of the substitution: 您可以使用qr//语法引用替换的“搜索”部分:

my $search = qr/(<[^>]+>)/;
$str =~ s/$search/foo,$1,bar/;

However I don't know of a way of quoting the "replacement" part adequately. 但是,我不知道如何充分引用“替换”部分。 I had hoped that qr// would work for this too, but it doesn't. 我曾希望qr//也能为此工作,但事实并非如此。 There are two alternatives worth considering: 有两种选择值得考虑:

1. Use eval() in your foreach loop. 1.在foreach循环中使用eval() This would enable you to keep your current %rxcheck2 hash. 这将使您能够保留当前的%rxcheck2哈希值。 Downside: you should always be concerned about safety with string eval() s. 缺点:你应该始终关注字符串eval()的安全性。

2. Use an array of anonymous subroutines: 2.使用匿名子例程数组:

my @replacements = (
    sub { $_[0] =~ s/<[^>]+>/ /g; },
    sub { $_[0] =~ s/\s+/ /g; },
    sub { $_[0] =~ s/[\(\{\[]\d+[\(\{\[]/ /g; },
    sub { $_[0] =~ s/\s+[<>]+\s+/\. /g },
    sub { $_[0] =~ s/\s+/ /g; },
    sub { $_[0] =~ s/\.*\s*[\*|\#]+\s*([A-Z\"])/\. $1/g; },
    sub { $_[0] =~ s/\.\s*\([^\)]*\) ([A-Z])/\. $1/g; }
);

# Assume your data is in $_
foreach my $repl (@replacements) {
    &{$repl}($_);
}

You could of course use a hash instead with some more useful key as the hash, and/or you could use multivalued elements (or hash values) including comments or other information. 您当然可以使用哈希,而使用一些更有用的键作为哈希,和/或您可以使用包含注释或其他信息的多值元素(或哈希值)。

Hashes are not good because they are unordered. 哈希不好,因为它们是无序的。 I find an array of arrays whose second array contains a compiled regex and a string to eval (actually it is a double eval) works best: 我找到一个数组数组,其第二个数组包含一个编译的正则表达式和一个字符串到eval(实际上它是一个双eval)效果最好:

#!/usr/bin/perl

use strict;
use warnings;

my @replace = (
    [ qr/(bar)/ => '"<$1>"' ],
    [ qr/foo/   => '"bar"'  ],
);

my $s = "foo bar baz foo bar baz";

for my $replace (@replace) {
    $s =~ s/$replace->[0]/$replace->[1]/gee;
}

print "$s\n";

I think j_random_hacker's second solution is vastly superior to mine. 我认为j_random_hacker的第二个解决方案远远优于我的解决方案。 Individual subroutines give you the most flexibility and are an order of magnitude faster than my /ee solution: 单个子程序为您提供最大的灵活性,比我/ee解决方案快一个数量级:

bar <bar> baz bar <bar> baz
bar <bar> baz bar <bar> baz
         Rate refs subs
refs  10288/s   -- -91%
subs 111348/s 982%   --

Here is the code that produces those numbers: 以下是生成这些数字的代码:

#!/usr/bin/perl

use strict;
use warnings;

use Benchmark;

my @subs = (
    sub { $_[0] =~ s/(bar)/<$1>/g },
    sub { $_[0] =~ s/foo/bar/g },
);

my @refs = (
    [ qr/(bar)/ => '"<$1>"' ],
    [ qr/foo/   => '"bar"'  ],
);

my %subs = (
    subs => sub {
        my $s = "foo bar baz foo bar baz";
        for my $sub (@subs) {
            $sub->($s);
        }
        return $s;
    },
    refs => sub {
        my $s = "foo bar baz foo bar baz";
        for my $ref (@refs) {
            $s =~ s/$ref->[0]/$ref->[1]/gee;
        }
        return $s;
    }
);

for my $sub (keys %subs) {
    print $subs{$sub}(), "\n";
}

Benchmark::cmpthese -1, \%subs;

You say you are dealing with HTML. 你说你正在处理HTML。 You are now realizing that this is pretty much a losing battle with fleeting and fragile solutions. 你现在意识到这对于稍纵即逝的脆弱解决方案来说几乎是一场失败的战斗。

A proper HTML parser would be make your life easier. 一个合适的HTML解析器将使您的生活更轻松。 HTML::Parser can be hard to use but there are other very useful libraries on CPAN which I can recommend if you can specify what you are trying to do rather than how . HTML解析器::可能很难使用,但也有其他非常有用的库CPAN ,我可以建议,如果你可以指定正在尝试做的,而不是如何

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM