比较非匹配正则表达式的速度

Question

The following Python code is incredibly slow: 以下Python代码非常慢：

import re
re.match( '([a]+)+c', 'a' * 30 + 'b' )

and it gets worse if you replace 30 with a larger constant. 如果用更大的常数替换30，情况会变得更糟。

I suspect that the parsing ambiguity due to the consecutive + is the culprit, but I'm not very expert in regexp parsing and matching. 我怀疑由于连续+导致的解析歧义是罪魁祸首，但我在regexp解析和匹配方面不是很专业。 Is this a bug of the Python regexp engine, or any reasonable implementation will do the same? 这是Python regexp引擎的错误，还是任何合理的实现都会这样做？

I'm not a Perl expert, but the following returns quite fast 我不是Perl专家，但以下回复非常快

perl -e '$s="aaaaaaaaaaaaaaaaaaaaaaaaaaaaaab"; print "ok\n" if $s =~ m/([a]+)+c/;'

and increasing the number of 'a' does not alter substantially the execution speed. 并且增加'a'的数量并不会显着改变执行速度。

Answer 1

I assume that Perl is clever enough to collapse the two + s into one, while Python is not. 我认为Perl足够聪明，可以将两个+ s合并为一个，而Python则不然。 Now let's imagine what the engine does, if this is not optimized away. 现在让我们想象一下引擎的功能，如果没有优化的话。 And remember that capturing is generally expensive. 请记住，捕获通常很昂贵。 Note also, that both + s are greedy, so the engine will try to use as many repetitions as possible in one backtracking step. 另请注意，两个+ s都是贪婪的，因此引擎将尝试在一个回溯步骤中尽可能多地使用重复。 Each bullet point represents one backtracking step: 每个项目符号点代表一个回溯步骤：

The engine uses as many [a] as possible, and consumes all thirty a s. 该发动机采用尽可能多的[a]越好，消耗所有的30个a秒。 Then it can not go any further, so it leaves the first repetition and captures 30 a s. 然后，它不能再往前走，所以给人们留下的第一重复和捕获 30 a秒。 Now the next repetition is on and it tries to consume some more with another ([a]+) but that doesn't work of course. 现在下一个重复开始，它试图用另一个([a]+)消耗更多但当然不起作用。 And then the c fails to match b . 然后c无法匹配b 。
Backtrack! 原路返回！ Throw away the last a consumed by the inner repetition. 抛弃内心重复消耗的最后a 。 After this we leave the inner repetition again, so the engine will capture 29 a s. 在此之后，我们再离开内重复，因此发动机将捕获 29 a秒。 Then the other + kicks in, the inner repetition is tried out again (consuming the 30th a ). 然后另一个+踢，内部重复再次尝试（消耗30日a ）。 Then we leave the inner repetition once again, which also leaves the capturing group, so the first capture is thrown away and the engine captures the last a . 然后我们再次离开内部重复，这也离开捕获组，因此第一次捕获被丢弃，引擎捕获最后a 。 c fails to match b . c无法匹配b 。
Backtrack! 原路返回！ Throw away another a inside. 扔掉其他a内。 We capture 28 a s. 我们捕获 28 a秒。 The second (outer repetition) of the capturing group consumes the last 2 a s which are captured . 捕获组的第二（外重复）消耗的最后2 a被拍摄的第 c fails to match b . c无法匹配b 。
Backtrack! 原路返回！ Now we can backtrack in the second other repetition and throw away the second of two a s. 现在，我们可以在第二其它重复走回头路，扔掉的两个第二a秒。 The one that is left will be captured . 剩下的那个将被捕获。 Then we enter the capturing group for the third time and capture the last a . 然后我们第三次进入捕获组并捕获最后a 。 c fails to match b . c无法匹配b 。
Backtrack! 原路返回！ Down to 27 a s in the first repetition. 下到27 a在第一次重复秒。

Here is a simple visualization. 这是一个简单的可视化。 Each line represents one backtracking step, and each set of parentheses shows one consumption of the inner repetition. 每一行代表一个回溯步骤，每组括号显示一次内部重复的消耗。 The curly brackets represent those that are newly captured for that step of backtracking, while normal parentheses are not revisited in this particular backtracking step. 大括号表示为该回溯步骤新捕获的那些，而在此特定回溯步骤中不重新访问正常括号。 And I leave out the b / c because it will never be matched: 我省略了b / c因为它永远不会匹配：

{aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
{aaaaaaaaaaaaaaaaaaaaaaaaaaaaa}{a}
{aaaaaaaaaaaaaaaaaaaaaaaaaaaa}{aa}
(aaaaaaaaaaaaaaaaaaaaaaaaaaaa){a}{a}
{aaaaaaaaaaaaaaaaaaaaaaaaaaa}{aaa}
(aaaaaaaaaaaaaaaaaaaaaaaaaaa){aa}{a}
(aaaaaaaaaaaaaaaaaaaaaaaaaaa){a}{aa}
(aaaaaaaaaaaaaaaaaaaaaaaaaaa)(a){a}{a}
{aaaaaaaaaaaaaaaaaaaaaaaaaa}{aaaa}
(aaaaaaaaaaaaaaaaaaaaaaaaaa){aaa}{a}
(aaaaaaaaaaaaaaaaaaaaaaaaaa){aa}{aa}
(aaaaaaaaaaaaaaaaaaaaaaaaaa)(aa){a}{a}
(aaaaaaaaaaaaaaaaaaaaaaaaaa){a}{aaa}
(aaaaaaaaaaaaaaaaaaaaaaaaaa)(a){aa}{a}
(aaaaaaaaaaaaaaaaaaaaaaaaaa)(a){a}{aa}
(aaaaaaaaaaaaaaaaaaaaaaaaaa)(a)(a){a}{a}

And. 和。 so. 所以。 on. 上。

Note that in the end the engine will also try all combinations for subsets of a (backtracking just through the first 29 a s then through the first 28 a s) just to discover, that c does also not match a . 需要注意的是，最终的引擎也将尝试所有组合的子集a （通过第一只29回溯a当时的经过第一28 a刚发现S），该c的确也无法比拟的a 。

The explanation of regex engine internals is based on information scattered around regular-expressions.info . 正则表达式引擎内部的解释基于分散在regular-expressions.info周围的信息。

To solve this. 要解决这个问题。 Simply remove one of the + s. 只需删除其中一个+ s。 Either r'a+c' or if you do want to capture the amount of a s use r'(a+)s' . 无论是r'a+c'或者如果你想捕获量a的使用r'(a+)s' 。

Finally, to answer your question. 最后，回答你的问题。 I would not consider this a bug in Python's regex engine, but only (if anything) a lack in optimization logic. 我不认为这是Python的正则表达式引擎中的错误，但只是（如果有的话）缺乏优化逻辑。 This problem is not generally solvable, so it is not too unreasonably for an engine to assume, that you have to take care of catastrophic backtracking yourself. 这个问题通常是不可解决的，所以对于一个引擎来说，假设你必须自己处理灾难性的回溯并不是太不合理。 If Perl is clever enough to recognize sufficiently simple cases of it, so much the better. 如果Perl足够聪明，能够识别出足够简单的案例，那就更好了。

Answer 2

Rewrite your regular expression to eliminate the "catastrophic backtracking" , by removing the nested quantifiers (see this question ): 通过删除嵌套量词来重写正则表达式以消除“灾难性回溯” （请参阅此问题）：

re.match( '([a]+)+c', 'a' * 30 + 'b' )
# becomes
re.match( 'a+c', 'a' * 30 + 'b' )

比较非匹配正则表达式的速度

问题描述

2 个解决方案

解决方案1
13 2012-11-01 14:42:12

解决方案2
4 2012-11-01 14:41:42

比较非匹配正则表达式的速度

问题描述

2 个解决方案

解决方案1 13 2012-11-01 14:42:12

解决方案2 4 2012-11-01 14:41:42

解决方案1
13 2012-11-01 14:42:12

解决方案2
4 2012-11-01 14:41:42