[英]regular expression goes into infinite loop
I am parsing (species) names of the form: 我正在解析(种类)表格的名称:
Parus Ater
H. sapiens
T. rex
Tyr. rex
which normally have two terms (binomial) but sometimes have 3 or more. 通常有两个项(二项式),但有时有3个或更多。
Troglodytes troglodytes troglodytes
E. rubecula sensu stricto
I wrote 我写
[A-Z][a-z]*\.?\s+[a-z][a-z]+(\s*[a-z]+)*
which worked most of the time but occasionally went into an infinite loop. 大部分时间都有效,但偶尔会陷入无限循环。 It took some time to track down that it was in the regex matching and then I realised it was a typo and I should have written 花了一些时间来查找正则表达式匹配中的内容,然后我才意识到这是一个错字,我应该写
[A-Z][a-z]*\.?\s+[a-z][a-z]+(\s+[a-z]+)*
which performs properly. 正确执行。
My questions are: 我的问题是:
[Note: I don't need a more general expression for species - there is a formal 100+ line regex specification for Species names - this was just an initial filter]. [注意:我不需要物种的更一般的表达-物种名称有一个正式的100+行正则表达式规范-这只是一个初始过滤器]。
NOTE: The problem arose because although most names were extracted precisely into 2 or occasionally 3/4 terms (as they were in italics) there were a few false positives (like "Homo sapiens lives in big cities like London"
) and the match fails at "L".] 注意:之所以出现此问题,是因为尽管大多数名称被精确地提取为2个或偶数3/4个词(如斜体字),但仍有一些误报(例如"Homo sapiens lives in big cities like London"
)并且匹配失败在“ L”处。]
NOTE: In debugging this I have found that the regex was often completing but being very slow (eg on shorter target strings). 注意:在调试此程序时,我发现正则表达式经常完成但非常慢(例如,在较短的目标字符串上)。 It is valuable that I found this bug through a pathological case. 通过病理案例发现此错误很有价值。 I have learnt an important lesson! 我学到了重要的一课!
To address the first part of your question, you should read up on catastrophic backtracking . 为了解决问题的第一部分,您应该阅读灾难性的回溯 。 Essentially, what is happening is there are too many ways to match your regular expression with your string, and the parser is continually back tracking to try and make it work. 本质上,正在发生的事情是有太多方法可以将您的正则表达式与您的字符串进行匹配,并且解析器会不断回溯以尝试使其正常工作。
In your case, it was probably the nested repitition: (\\s*[az]+)*
Which likely caused some very very strange loops. 在您的情况下,可能是嵌套的重新放置: (\\s*[az]+)*
可能导致了一些非常非常奇怪的循环。 As Qtax has adeptly pointed out, it's hard to tell without more information. 正如Qtax熟练地指出的那样,没有更多的信息就很难分辨。
The second part of your question is, unfortunately, impossible to answer. 不幸的是,您问题的第二部分不可能回答。 It's basically the Halting problem . 基本上是停止问题 。 Since Regular Expressions are essentially of a finite state machine whose input is a string, you cannot create a general solution which predicts which regular expressions will backtrack catastrophically, and which will not. 由于正则表达式本质上是输入是字符串的有限状态机,因此您无法创建一个通用的解决方案来预测哪些正则表达式将发生灾难性的回退,而哪些不会。
As far as some tips for making your regular expressions run faster? 至于使您的正则表达式运行更快的一些技巧? That's a big can of worms. 那是一大罐蠕虫。 I've spent a lot of time studying regular expressions on my own, and some time optimizing them, and here's what I've found generally helps: 我花了很多时间独自研究正则表达式,并花了一些时间对其进行优化,以下是我发现的一般帮助:
^
for the beginning of the string. 尤其是字符串开头的^
。 See also: Word Boundaries 另请参阅: 字边界 Hope this helps you. 希望这对您有所帮助。 Good luck. 祝好运。
For the first regex: 对于第一个正则表达式:
[A-Z][a-z]*\.?\s+[a-z][a-z]+(\s*[a-z]+)*
The catastrophic backtracking happens due to (\\s*[az]+)*
as pointed out in the comment. 如评论中指出的,灾难性的回溯是由于(\\s*[az]+)*
而发生的。 However, it only holds true if you are validating the string with String.matches()
, since this is the only case where encountering an invalid character causes the engine to try and backtrack, rather than returning a match ( Matcher
loop). 但是,只有在使用String.matches()
验证字符串时,它才成立,因为这是遇到无效字符导致引擎尝试回溯而不返回匹配( Matcher
循环)的唯一情况。
Let us match an invalid string against (\\s*[az]+)*
: 让我们将无效字符串与(\\s*[az]+)*
匹配:
inputstringinputstring;
(Repetition 1)
\s*=(empty)
[a-z]+=inputstringinputstring
FAILED
Backtrack [a-z]+=inputstringinputstrin
(Repetition 2)
\s*=(empty)
[a-z]+=g
FAILED
(End repetition 2 since all choices are exhausted)
Backtrack [a-z]+=inputstringinputstri
(Repetition 2)
\s*=(empty)
[a-z]+=ng
FAILED
Backtrack [a-z]+=n
(Repetition 3)
\s*(empty)
[a-z]+=g
FAILED
(End repetition 3 since all choices are exhausted)
(End repetition 2 since all choices are exhausted)
Backtrack [a-z]+=inputstringinputstr
By now, you should have notice the problem. 现在,您应该已经注意到了问题。 Let us define T(n)
as the amount of work to check a string of length n does not match the pattern. 让我们将T(n)
定义为检查长度为n的字符串与模式不匹配的工作量。 From the method of backtracking, we know T(n) = Sum [i = 0..(n-1)] T(i)
. 从回溯的方法中,我们知道T(n) = Sum [i = 0..(n-1)] T(i)
。 From that, we can derive T(n + 1) = 2T(n)
, which means that the backtracking process is exponential in time complexity! 由此,我们可以得出T(n + 1) = 2T(n)
,这意味着回溯过程的时间复杂度是指数级的!
Changing *
to +
avoids the problem completely , since an instance of repetition can only start at the boundary between a space character and an English alphabet character. 将*
更改为+
可以完全避免问题 ,因为重复的实例只能从空格字符和英文字母字符之间的边界开始。 In contrast, the first regex allows an instance of repetition to start in-between any 2 alphabet characters. 相反,第一个正则表达式允许重复实例在任意2个字母字符之间开始。
To demonstrate this, (\\s+[az]+\\s*)*
will give you backtracking hell when the invalid input string contains many words which are separated with multiple consecutive spaces, since it allows multiple places for a repetition instance to start. 为了说明这一点, (\\s+[az]+\\s*)*
将使您回溯到无效输入字符串包含许多单词时,这些单词被多个连续的空格分隔开,因为它允许重复实例的多个位置开始。 This follows the formula b^d
where b
is the maximum number of consecutive spaces (minus 1) and d
is the number of sequences of spaces. 这遵循公式b^d
,其中b
是连续空格的最大数量(负1),而d
是空格序列的数量。 It is less severe than the first regex you have (it requires at least one Englsh alphabet and one space character per repetition, as opposed to one English alphabet per repetition in your first regex), but it is still a problem. 它不如您拥有的第一个正则表达式严重(它需要至少一个Englsh字母和每个重复一个空格字符,而不是您的第一个正则表达式中每个重复一个英文字母),但这仍然是一个问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.