简体   繁体   English

正则表达式进入无限循环

[英]regular expression goes into infinite loop

I am parsing (species) names of the form: 我正在解析(种类)表格的名称:

Parus Ater
H. sapiens
T. rex
Tyr. rex

which normally have two terms (binomial) but sometimes have 3 or more. 通常有两个项(二项式),但有时有3个或更多。

Troglodytes troglodytes troglodytes 
E. rubecula sensu stricto

I wrote 我写

[A-Z][a-z]*\.?\s+[a-z][a-z]+(\s*[a-z]+)*

which worked most of the time but occasionally went into an infinite loop. 大部分时间都有效,但偶尔会陷入无限循环。 It took some time to track down that it was in the regex matching and then I realised it was a typo and I should have written 花了一些时间来查找正则表达式匹配中的内容,然后我才意识到这是一个错字,我应该写

[A-Z][a-z]*\.?\s+[a-z][a-z]+(\s+[a-z]+)*

which performs properly. 正确执行。

My questions are: 我的问题是:

  • why does this loop happen? 为什么会发生这种循环?
  • is there a way I can check for similar regex errors before running the program? 有没有办法可以在运行程序之前检查类似的正则表达式错误? Otherwise it may be difficult to trap them before the prgram is distributed and cause problems. 否则,可能很难在prgram分发之前就捕获它们并引起问题。

[Note: I don't need a more general expression for species - there is a formal 100+ line regex specification for Species names - this was just an initial filter]. [注意:我不需要物种的更一般的表达-物种名称有一个正式的100+行正则表达式规范-这只是一个初始过滤器]。

NOTE: The problem arose because although most names were extracted precisely into 2 or occasionally 3/4 terms (as they were in italics) there were a few false positives (like "Homo sapiens lives in big cities like London" ) and the match fails at "L".] 注意:之所以出现此问题,是因为尽管大多数名称被精确地提取为2个或偶数3/4个词(如斜体字),但仍有一些误报(例如"Homo sapiens lives in big cities like London" )并且匹配失败在“ L”处。]

NOTE: In debugging this I have found that the regex was often completing but being very slow (eg on shorter target strings). 注意:在调试此程序时,我发现正则表达式经常完成但非常慢(例如,在较短的目标字符串上)。 It is valuable that I found this bug through a pathological case. 通过病理案例发现此错误很有价值。 I have learnt an important lesson! 我学到了重要的一课!

To address the first part of your question, you should read up on catastrophic backtracking . 为了解决问题的第一部分,您应该阅读灾难性的回溯 Essentially, what is happening is there are too many ways to match your regular expression with your string, and the parser is continually back tracking to try and make it work. 本质上,正在发生的事情是有太多方法可以将您的正则表达式与您的字符串进行匹配,并且解析器会不断回溯以尝试使其正常工作。

In your case, it was probably the nested repitition: (\\s*[az]+)* Which likely caused some very very strange loops. 在您的情况下,可能是嵌套的重新放置: (\\s*[az]+)*可能导致了一些非常非常奇怪的循环。 As Qtax has adeptly pointed out, it's hard to tell without more information. 正如Qtax熟练地指出的那样,没有更多的信息就很难分辨。

The second part of your question is, unfortunately, impossible to answer. 不幸的是,您问题的第二部分不可能回答。 It's basically the Halting problem . 基本上是停止问题 Since Regular Expressions are essentially of a finite state machine whose input is a string, you cannot create a general solution which predicts which regular expressions will backtrack catastrophically, and which will not. 由于正则表达式本质上是输入是字符串的有限状态机,因此您无法创建一个通用的解决方案来预测哪些正则表达式将发生灾难性的回退,而哪些不会。

As far as some tips for making your regular expressions run faster? 至于使您的正则表达式运行更快的一些技巧? That's a big can of worms. 那是一大罐蠕虫。 I've spent a lot of time studying regular expressions on my own, and some time optimizing them, and here's what I've found generally helps: 我花了很多时间独自研究正则表达式,并花了一些时间对其进行优化,以下是我发现的一般帮助:

  1. Compile your regular expressions outside of your loops, if your language supports it. 如果您的语言支持,则在循环之外编译正则表达式。
  2. Whenever possible, add anchors when you know they're useful. 只要有可能,就添加锚点 Especially the ^ for the beginning of the string. 尤其是字符串开头的^ See also: Word Boundaries 另请参阅: 字边界
  3. Avoid nested repetition like the plague. 避免像瘟疫一样嵌套重复。 If you have to have it (which you will), do your best to provide hints to the engine to short circuit any unintended backtracking. 如果必须(必须这样做),请尽力为引擎提供提示,以使所有意外回溯短路。
  4. Take advantage of flavor constructs to speed things up. 利用风味构造来加快处理速度。 I'm partial to Non-Capturing groups and possessive quantifiers . 我偏爱非捕获组所有格限定词 They don't appear in every flavor, but when they do, you should use them. 它们不会出现在每种口味中,但是当它们出现时,应该使用它们。 Also check out Atomic Groups 还要检查原子组
  5. I always find this to be true: The longer your regular expression gets, The more trouble you're going to have making it efficient. 我总是发现这是对的: 正则表达式获取的时间越长,使它变得高效的麻烦就越大。 Regular expressions are a great and powerful tool, they're like a super smart hammer. 正则表达式是一个强大而强大的工具,就像一个超级聪明的锤子。 Don't fall into the trap of seeing everything as a nail . 不要陷入把一切视为钉子的陷阱 Sometimes the string function you're looking for is right under your nose. 有时,您要查找的字符串函数就在您的鼻子下面。

Hope this helps you. 希望这对您有所帮助。 Good luck. 祝好运。

For the first regex: 对于第一个正则表达式:

[A-Z][a-z]*\.?\s+[a-z][a-z]+(\s*[a-z]+)*

The catastrophic backtracking happens due to (\\s*[az]+)* as pointed out in the comment. 如评论中指出的,灾难性的回溯是由于(\\s*[az]+)*而发生的。 However, it only holds true if you are validating the string with String.matches() , since this is the only case where encountering an invalid character causes the engine to try and backtrack, rather than returning a match ( Matcher loop). 但是,只有在使用String.matches()验证字符串时,它才成立,因为这是遇到无效字符导致引擎尝试回溯而不返回匹配( Matcher循环)的唯一情况。

Let us match an invalid string against (\\s*[az]+)* : 让我们将无效字符串与(\\s*[az]+)*匹配:

inputstringinputstring;

(Repetition 1)
\s*=(empty)
[a-z]+=inputstringinputstring
FAILED

Backtrack [a-z]+=inputstringinputstrin
(Repetition 2)
\s*=(empty)
[a-z]+=g
FAILED

(End repetition 2 since all choices are exhausted)
Backtrack [a-z]+=inputstringinputstri
(Repetition 2)
\s*=(empty)
[a-z]+=ng
FAILED

Backtrack [a-z]+=n
(Repetition 3)
\s*(empty)
[a-z]+=g
FAILED

(End repetition 3 since all choices are exhausted)
(End repetition 2 since all choices are exhausted)
Backtrack [a-z]+=inputstringinputstr

By now, you should have notice the problem. 现在,您应该已经注意到了问题。 Let us define T(n) as the amount of work to check a string of length n does not match the pattern. 让我们将T(n)定义为检查长度为n的字符串与模式不匹配的工作量。 From the method of backtracking, we know T(n) = Sum [i = 0..(n-1)] T(i) . 从回溯的方法中,我们知道T(n) = Sum [i = 0..(n-1)] T(i) From that, we can derive T(n + 1) = 2T(n) , which means that the backtracking process is exponential in time complexity! 由此,我们可以得出T(n + 1) = 2T(n) ,这意味着回溯过程的时间复杂度是指数级的!

Changing * to + avoids the problem completely , since an instance of repetition can only start at the boundary between a space character and an English alphabet character. *更改为+ 可以完全避免问题 ,因为重复的实例只能从空格字符和英文字母字符之间的边界开始。 In contrast, the first regex allows an instance of repetition to start in-between any 2 alphabet characters. 相反,第一个正则表达式允许重复实例在任意2个字母字符之间开始。

To demonstrate this, (\\s+[az]+\\s*)* will give you backtracking hell when the invalid input string contains many words which are separated with multiple consecutive spaces, since it allows multiple places for a repetition instance to start. 为了说明这一点, (\\s+[az]+\\s*)*将使您回溯到无效输入字符串包含许多单词时,这些单词被多个连续的空格分隔开,因为它允许重复实例的多个位置开始。 This follows the formula b^d where b is the maximum number of consecutive spaces (minus 1) and d is the number of sequences of spaces. 这遵循公式b^d ,其中b是连续空格的最大数量(负1),而d是空格序列的数量。 It is less severe than the first regex you have (it requires at least one Englsh alphabet and one space character per repetition, as opposed to one English alphabet per repetition in your first regex), but it is still a problem. 它不如您拥有的第一个正则表达式严重(它需要至少一个Englsh字母和每个重复一个空格字符,而不是您的第一个正则表达式中每个重复一个英文字母),但这仍然是一个问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM