[英]Why my Python regular expression pattern run so slowly?
Please see my regular expression pattern code: 请查看我的正则表达式模式代码:
#!/usr/bin/env python
# -*- coding:utf-8 -*-
import re
print 'Start'
str1 = 'abcdefgasdsdfswossdfasdaef'
m = re.match(r"([A-Za-z\-\s\:\.]+)+(\d+)\w+", str1) # Want to match something like 'Moto 360x'
print m # None is expected.
print 'Done'
It takes 49 seconds to finish, any problem with the pattern? 完成需要49秒,模式有什么问题吗?
See Runaway Regular Expressions: Catastrophic Backtracking . 请参阅失控正则表达式:灾难性回溯 。
In brief, if there are extremely many combinations a substring can be split into the parts of the regex, the regex matcher may end up trying them all. 简而言之,如果有很多组合,子字符串可以拆分为正则表达式的各个部分,则正则表达式匹配器最终可能会尝试全部。
Constructs like (x+)+
and x+x+
practically guarantee this behaviour. 像
(x+)+
和x+x+
这样的构造实际上保证了这种行为。
To detect and fix the problematic constructs, the following concept can be used: 要检测并修复有问题的构造,可以使用以下概念:
At conceptual level, the presence of a problematic construct means that your regex is ambiguous - ie if you disregard greedy/lazy behaviour, there's no single "correct" split of some text into the parts of the regex (or, equivalently, a subexpression thereof). 在概念层面,有问题的构造的存在意味着你的正则表达式是模糊的 - 即如果你忽略了贪婪/懒惰的行为,那么某些文本没有单独的“正确”分割成正则表达式的部分 (或者,等效地,其子表达式) )。 So, to avoid/fix the problems, you need to see and eliminate all ambiguities.
因此,为避免/解决问题,您需要查看并消除所有歧义。
One way to do this is to 一种方法是
Just repost the answer and solution in comments from nhahtdh and Marc B: 只需在nhahtdh和Marc B的评论中重新发布答案和解决方案:
([A-Za-z\\-\\s\\:\\.]+)+
--> [A-Za-z\\-\\s\\:\\.]+
([A-Za-z\\-\\s\\:\\.]+)+
-> [A-Za-z\\-\\s\\:\\.]+
Thanks so much to nhahtdh and Marc B! 非常感谢nhahtdh和Marc B!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.