简体   繁体   English

非常慢的正则表达式搜索

[英]Very slow regular expression search

I'm not sure I completely understand what is going on with the following regular expression search: 我不确定我是否完全理解以下正则表达式搜索的内容:

>>> import re
>>> template = re.compile("(\w+)+\.")
>>> target = "a" * 30
>>> template.search(target)

search() call takes minutes to complete, CPU usage goes to 100%. search()调用需要几分钟才能完成,CPU使用率达到100%。 The behavior is reproduceable for both 2.7.5 and 3.3.3 Python versions. 对于2.7.5和3.3.3 Python版本,该行为都是可重现的。

Interesting fact that if the string is less than 20-25 characters in length, search() returns like in no time. 有趣的是,如果字符串的长度小于20-25个字符,则search()返回。

What is happening? 怎么了?

Understanding this problem requires understanding how NFA works under RegExp. 了解此问题需要了解NFA在RegExp下的工作方式。

Elaborating the definition of NFA may be a mission too heavy for me. 阐述NFA的定义对我来说可能是一项过于繁重的任务。 Search NFA on wiki it will gives you a better explanation. 在维基上搜索NFA,它会为您提供更好的解释。 Here just think NFA is a robot finding patterns you give. 这里只想NFA是一个机器人发现你给出的模式。

Crudely implemented NFA is somewhat dumb, it just looks ahead one or two tokens you give. 粗暴地实施NFA有点愚蠢,它只是向前看你给的一两个代币。 So in the synthetic example you give, NFA just looks \\w+ at first (not parenthesis for grouping). 所以在你给出的合成例子中,NFA首先看起来是\\w+ (不是用于分组的括号)。

Because + is a greedy quantifier, that is, matches as many characters as possible, so NFA dumbly continues to consume characters in target . 因为+是一个贪婪的量词,也就是说,匹配尽可能多的字符,所以NFA勉强继续消耗target字符。 After 30 a s, NFA encounters the end of string. 30后a S,NFA遇到字符串的结束。 After then does NFA realize that he needs to match other tokens in template . 之后,NFA意识到需要在template匹配其他令牌。 The next token is + . 下一个标记是+ NFA has matched it so it proceeds to \\. NFA已匹配,因此它将进入\\. . This time it fails. 这次失败了。

What NFA does next is to step one character back, trying to match the pattern by truncating the submatching of \\w+ . NFA接下来要做的就是将一个字符后退一步,尝试通过截断\\w+匹配来匹配模式。 So NFA split the target in to two groups, 29 a s for one \\w+ , and one trailing a . 所以NFA将target分成两组, a组是一个\\w+ ,另一组a NFA first tries to consume the trailing a by matching it against the second + , but it still fails when NFA meeting \\. NFA首先尝试通过将其与第二个+匹配来消耗尾随a,但是当NFA会议\\.时它仍然会失败\\. . NFA continues the process above until it gets a full match, otherwise it will tries all possible partitions. NFA继续上述过程,直到获得完全匹配,否则它将尝试所有可能的分区。

So (\\w+)+\\. 所以(\\w+)+\\. instructs NFA to group target in such manner: target is partitioned into one or more groups, at least one character per group, and target is end with a period '.'. 指示NFA以这种方式对target进行分组:目标被划分为一个或多个组,每组至少一个字符,目标以句点'。'结束。 As long as the period is not matched. 只要期间不匹配。 NFA tries all partitions possible. NFA尝试所有分区。 So how many partitions are there? 那么有多少分区? 2^n, the exponential of 2. (JUst think inserting separator between a ). 2 ^ n,指数为2.(JUst认为在a之间插入分隔符)。 Like below 如下

aaaaaaa a
aaaaaa aa
aaaaaa a a
.....
.......
aa a a ... a
a a a a a .... a

If NFA matches \\. 如果NFA匹配\\. , it won't hurt much. ,它不会伤害太多。 But when it fails to match, this expression is doomed to be never-ending exponential . 但是当它无法匹配时,这个表达式注定永无止境的指数。

I'm not advertising but Mastering Regular Expression is a good book to understand mechanism under RegExp. 我不是广告,但掌握正则表达式是理解RegExp机制的好书。

The slowness is caused by backtracking of the engine: 缓慢是由引擎的回溯引起的:

(\w+)+\.

Backtracking will naturally occur with this pattern if there's no . 如果没有,这种模式自然会发生回溯. at the end of your string. 在你的字符串的末尾。 The engine will first attempt to match as many \\w as possible and backtracks when it finds out that more characters need to be matched before the end of your string. 当引擎发现在字符串结束之前需要匹配更多字符时,引擎将首先尝试匹配尽可能多的\\w并回溯。

(a x 59) .
(a x 58) .
...
(a) .

Finally it will fail to match. 最后它将无法匹配。 However, the second + in your pattern causes the engine to inspect (n-1)! 但是,模式中的第二个+会导致引擎检查(n-1)! possible paths, so: 可能的路径,所以:

(a x 58) (a) .
(a x 57) (a) (a) .
(a x 57) (a x 2) .
...
(a) (a) (a) (a) (a) (a) (a) ...

Removing the + will prevent an abnormal amount of backtracking: 删除+将防止异常的回溯量:

(\w+)\.

Some implementations will also support possessive quantifiers, which might be more ideal in this particular scenario: 一些实现还将支持占有量词,在这种特定场景中可能更理想:

(\w++)\.

The second plus is causing issues: 第二个加号导致问题:

template = re.compile("(\w+)\.")

works fine for me. 对我来说很好。 To see the parse tree for the regex, pass in re.DEBUG as the 2nd argument to compile: 要查看正则表达式的解析树,请将re.DEBUG作为第二个参数传递给:

import re

re.compile("(\w+)+\.", re.DEBUG)
print "\n\n"
re.compile("(\w+)\.", re.DEBUG)


max_repeat 1 65535
  subpattern 1
    max_repeat 1 65535
      in
        category category_word
literal 46


subpattern 1
  max_repeat 1 65535
    in
      category category_word
literal 46

Process finished with exit code 0 进程以退出代码0结束

That proves that the second plus is adding a loop, which the python regex parser must cap at 65535. That somewhat proves my theory. 这证明了第二个加号是添加一个循环,python正则表达式解析器必须限制在65535.这有点证明了我的理论。

Note that to run that you will want a fresh python interpreter for each execution. 请注意,要运行它,您将需要为每次执行提供一个新的python解释器。 re.compile memoizes the values passed in, so it will no re-compile the same regex twice, repeatedly running that in ipython for instance does not print out the parse tree after the first time you run it. re.compile memoizes传入的值,因此它不会重新编译相同的正则表达式两次,例如在ipython中重复运行它不会在第一次运行它时打印出解析树。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM