简体   繁体   English

计算Perl或Ruby中重叠的正则表达式匹配项

[英]Count overlapping regex matches in Perl OR Ruby

This is a follow-up to that question . 这是该问题的后续措施。 I've learned that finding overlapping regex matches in Python is not straight-forward, so decided to do an additional inquiry to see how Perl and Ruby stand up to this task. 我了解到,在Python中查找重叠的正则表达式匹配不是很直接的,因此决定进行一次额外的查询,以了解Perl和Ruby如何承担这项任务。

I'd like to count the number of all possible matches of a regex against a certain string. 我想计算某个字符串对正则表达式的所有可能匹配项的数量。 And by "all" I mean that the result should take into account both overlapping and non-unique matches. 所谓“全部”,是指结果应同时考虑重叠匹配和非唯一匹配。 Here are some examples: 这里有些例子:

  • a.*k should be matched twice in "akka" a.*k应该在"akka"匹配两次
  • "bbboob" tested against b.*o.*b should yield 6 针对b.*o.*b测试的"bbboob"应产生6

As a reference, here's a Perl one-liner suggested by tchrist - it outputs the correct matches and their count: 作为参考,这是tchrist建议的Perl 一线式 -输出正确的匹配及其计数:

() = "bbboobb" =~ /(b.*o.*b)(?{push @all, $1})(*FAIL)/g; printf "got %d matches: %s\n", scalar(@all), "@all";

The only problem with this is that it eats up too much resources for test cases where the resulting match count is in the order of millions or more. 唯一的问题是,如果测试用例的匹配数达到数百万或更多,那么它将消耗过多的资源。 But I understand it is due to the fact that all the matches are first groupped and only counted afterwards. 但我知道这是由于所有比赛都先分组,然后才算在内。 I'm looking for a resource-efficient solution that only returns the count . 我正在寻找一种仅返回count的资源有效解决方案。

It looks like tchrist has done all the hard work. 看来, tchrist已经完成了所有艰苦的工作。 If storing the matches and counting them afterwards is eating too much resource, then you could just change the regex-embedded code to just count the matches: 如果存储匹配项并随后对其进行计数消耗了太多资源,那么您只需更改正则表达式嵌入的代码即可对匹配项进行计数:

my $count = 0;

"bbboobb" =~ /(b.*o.*b)(?{$count++})(*FAIL)/g;

print "got $count matches\n";

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM