简体   繁体   English

Perl替换运算符可以匹配数组中的元素吗?

[英]Can Perl substitution operator match an element in an array?

I have an array like this 我有这样的数组

my @stopWords = ("and","this",....)

My text is in this variable 我的文字在这个变量中

my $wholeText = "....and so this is...."

I want to match every occurrence of every element of my stopWords array in the scalar wholeText and replace it with spaces. 我想匹配标量wholeText中我的stopWords数组的每个元素的每一个出现,并用空格替换它。

One way of doing this is as follows : 一种方法如下:

foreach my $stopW (@stopWords)
{
   $wholeText =~ s/$stopW/ /;
}

This works and replaces every occurrence of all the stop words. 这适用于并替换所有停用词的每次出现。 I was just wondering, if there is a shorter way of doing it. 我只是想知道,如果有更短的方法。

Like this: 像这样:

$wholeText =~ s/@stopWords/ /;

The above does not seem to work though. 以上似乎不起作用。

While the various map / for -based solutions will work , they'll also do regex processing of your string separately for each and every stopword. 虽然各种基于map / for的解决方案都可以使用 ,但它们也会针对每个停用词分别对字符串进行正则表达式处理。 While this is no big deal in the example given, it can cause major performance issues as the target text and stopword list grow. 虽然在给出的示例中这没什么大不了的,但随着目标文本和禁用词列表的增长,它可能会导致严重的性能问题。

Jonathan Leffler and Robert P are on the right track with their suggestions of mashing all the stopwords together into a single regex, but a simple join of all the stopwords into a single alternation is a crude approach and, again, becomes inefficient if the stopword list is long. 乔纳森莱弗勒和罗伯特P在正确的轨道上提出了将所有停用词混合成一个正则表达式的建议,但是将所有停用词简单地join到单个交替中是一种粗略的方法,如果禁用词列表再次变得低效长。

Enter Regexp::Assemble , which will build you a much 'smarter' regex to handle all the matches at once - I've used it to good effect with lists of up to 1700 or so words to be checked against: 输入Regexp :: Assemble ,它将为你构建一个更“智能”的正则表达式来同时处理所有匹配 - 我已经使用它来获得良好的效果,最多可以检查1700个单词的列表:

#!/usr/bin/env perl

use strict;
use warnings;
use 5.010;

use Regexp::Assemble;

my @stopwords = qw( and the this that a an in to );

my $whole_text = <<EOT;
Fourscore and seven years ago our fathers brought forth
on this continent a new nation, conceived in liberty, and
dedicated to the proposition that all men are created equal.
EOT

my $ra = Regexp::Assemble->new(anchor_word_begin => 1, anchor_word_end => 1);
$ra->add(@stopwords);
say $ra->as_string;

say '---';

my $re = $ra->re;
$whole_text =~ s/$re//g;
say $whole_text;

Which outputs: 哪个输出:

\b(?:t(?:h(?:at|is|e)|o)|a(?:nd?)?|in)\b
---
Fourscore  seven years ago our fathers brought forth
on  continent  new nation, conceived  liberty, 
dedicated   proposition  all men are created equal.

My best solution: 我最好的解决方案

$wholeText =~ s/$_//g for @stopWords;

You might want to sharpen the regexp using some \\b and whitespace. 您可能希望使用一些\\b和空格来锐化正则表达式。

What about: 关于什么:

my $qrstring = '\b(' . (join '|', @stopWords) . ')\b';
my $qr = qr/$qrstring/;
$wholeText =~ s/$qr/ /g;

Concatenate all the words to form ' \\b(and|the|it|...)\\b '; 连接所有单词以形成' \\b(and|the|it|...)\\b '; the parentheses around the join are necessary to give it a list context; 连接周围的括号是必要的,以给它一个列表上下文; without them, you end up with the count of the number of words). 如果没有它们,你最终会计算出单词的数量)。 The ' \\b ' metacharacters mark word boundaries, and therefore prevent you changing 'thousand' into 'thous'. ' \\b '元字符标记单词边界,因此阻止您将'千'变为'thous'。 Convert that into a quoted regular expression; 将其转换为带引号的正则表达式; apply it globally to your subject string (so that all occurrences of all stop words are removed in a single operation). 将其全局应用于主题字符串(以便在单个操作中删除所有出现的所有停用词)。

You can also do without the variable ' $qr ': 您也可以不使用变量' $qr ':

my $qrstring = '\b(' . (join '|', @stopWords) . ')\b';
$wholeText =~ s/$qrstring/ /g;

I don't think I'd care to maintain the code of anyone who managed to do without the variable ' $qrstring '; 我认为我不想维护任何没有变量' $qrstring '的人的代码; it probably can be done, but I don't think it would be very readable. 它可能已经完成,但我不认为它会非常易读。

My paranoid version: 我的偏执版:

$wholeText =~ s/\b\Q$_\E\b/ /gi for @stopWords;

Use \\b to match word boundaries, and \\Q..\\E just in case any of your stopwords contains characters which may be interpreted as "special" by the regex engine. 使用\\b来匹配单词边界,并使用\\Q..\\E以防万一你的任何一个停用词包含可能被正则表达式引擎解释为“特殊”的字符。

You could consider using a regex join to create a single regex. 您可以考虑使用正则表达式连接来创建单个正则表达式。

my $regex_str = join '|', map { quotemeta } @stopwords;
$string =~ /$regex_str/ /g;

Note that the quotemeta part just makes sure that any regex characters are properly escaped. 请注意, quotemeta部分只是确保正确转义任何正则表达式字符。

grep{$wholeText =~ s/\b$_\b/ /g}@stopWords;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM