简体   繁体   English

为什么这个正则表达式经历了灾难性的回溯?

[英]Why is this regex experiencing catastrophic backtracking?

I've read the articles and other questions on catastrophic backtracking in regular expressions and how it can be caused by nested + and * quantifiers. 我已经阅读了关于正则表达式中灾难性回溯的文章其他 问题 ,以及它是如何由嵌套的+*量词引起的。 However, my regexes are still encountering catastrophic backtracking without nested quantifiers. 然而,我的正则表达式仍然遇到没有嵌套量词的灾难性回溯。 Can someone help me understand why? 有人可以帮我理解为什么吗?

I wrote these regexes to search for a specific type of rhyme in lines of welsh poetry. 我写这些正则表达式来搜索威尔士诗歌中的特定类型的押韵 The rhyme consists of all the consonants in the beginning of the line being repeated at the end, and there must be a space between the beginning and end consonants. 押韵包括最后一行重复的所有辅音,并且在开始和结束辅音之间必须有一个空格。 I've already removed all the vowels, but there are two exceptions that make these regexes ugly. 我已经删除了所有的元音,但有两个例外使这些正则表达式变得难看。 First, there are allowed to be consonants in the middle that don't repeat, and if there are any, it's a different type of rhyme. 首先,允许在中间的辅音不重复,如果有的话,它是不同类型的押韵。 Second, the letters m, n, r, h, and v are allowed to interrupt the rhyme (appear in the beginning but not in the end or vice versa), but they can't be ignored because sometimes the rhyme consists only of those letters. 其次,字母m,n,r,h和v被允许中断押韵(出现在开头但不在结尾,反之亦然),但它们不能被忽略,因为有时押韵只包括那些字母。

My script automatically builds a regex for each line and tests it. 我的脚本自动为每一行构建一个正则表达式并对其进行测试。 It works the rest of the time, but this one line is giving catastrophic backtracking. 它在其余时间工作,但这一行是给灾难性的回溯。 The line's text without vowels is: 该行没有元音的文字是:

nn  Frvvn  Frv v

The regex automatically finds that nn Frvvn rhymes with Frv v , so then it tries it again with the last letter (the n in Frvvn ) required in the back. 正则表达式自动发现nn FrvvnFrv v押韵,然后它再次尝试使用后面所需的最后一个字母( Frvvnn )。 If it's not required, then the rhyme can be shortened. 如果不需要,则可以缩短押韵。 Here's the regex: 这是正则表达式:

^(?P<s_letters>         # starting letters
[mnrhv]*?\s*n{0,2}      # any number of optional letters or any number
                        # of spaces can come between rhyming letters
[mnrhv]*?\s*n{0,2}
[mnrhv]*?\s*F{1,2}
[mnrhv]*?\s*[rR]?(?:\s*[rR])? # r can also rhyme with R, but that's
                              # not relevant here (I think)
[mnrhv]*?\s*v{0,2}
[mnrhv]*?\s*v{0,2}
[mnrhv]*?\s*n{1,2}
[mnrhv\s]*?)
(?P<m_letters>          # middle letters
[^\s]*?(?P<caesura>\s)  # the caesura (end of the rhyme) is the
                        # first space after the rhyme     
.*)                     # End letters come as late as possible
(?P<e_letters>          # End group
[mnrhv]*?\s*n{0,2}
[mnrhv]*?\s*n{0,2}
[mnrhv]*?\s*F{1,2}
[mnrhv]*?\s*[rR]?(?:\s*[rR])?
[mnrhv]*?\s*v{0,2}
[mnrhv]*?\s*v{0,2}
[mnrhv]*?\s*n{1,2}
[mnrhv\s]*?)$

Even though it doesn't have any nested quantifiers, it still takes forever to run. 即使它没有任何嵌套量词,它仍然需要永远运行。 Regexes for other lines that were generated in the same way run quickly. 以相同方式生成的其他行的正则表达式可以快速运行。 Why is this? 为什么是这样?

I'm not seeing any nested quantifiers, but I am seeing a lot of ambiguities that would cause high-exponent polynomial runtime. 我没有看到任何嵌套量词,但我看到很多歧义会导致高指数多项式运行时。 For example, consider this part of the regex: 例如,考虑正则表达式的这一部分:

[mnrhv]*?\s*[rR]?(?:\s*[rR])? # r can also rhyme with R, but that's
                              # not relevant here (I think)
[mnrhv]*?\s*v{0,2}
[mnrhv]*?\s*v{0,2}
[mnrhv]*?\s*n{1,2}
[mnrhv\s]*?)
(?P<m_letters>          # middle letters
[^\s]*?(?P<caesura>\s)  # the caesura (end of the rhyme) is the

Suppose the regex engine is at this point, and the text it's seeing is just a huge block of n s. 假设正则表达式引擎在这一点上,它的看到文字的只是一块巨大n秒。 Those n s can be divided between the following parts of the regex: 那些n可以在正则表达式的以下部分之间划分:

[mnrhv]*?\s*[rR]?(?:\s*[rR])?
^^^^^^^^^

[mnrhv]*?\s*v{0,2}
^^^^^^^^^

[mnrhv]*?\s*v{0,2}
^^^^^^^^^
[mnrhv]*?\s*n{1,2}
^^^^^^^^^   ^^^^^^
[mnrhv\s]*?)
^^^^^^^^^^^
(?P<m_letters>
[^\s]*?(?P<caesura>\s)
^^^^^^^

If the number of n s is N , then there are O(N**6) ways to divide the n s, since there are 6 *? 如果n s的数量是N ,那么有O(N**6)方法来划分n s,因为有6 *? blocks that match n here, and everything in between is optional or also matches n . 这里匹配n块,其间的所有内容都是可选的,也可以匹配n

Are those \\s parts mandatory? 那些\\s部分的强制性? If so, you might be able to improve runtime by putting a + instead of a * on them. 如果是这样,您可以通过在它们上面添加+而不是*来改善运行时。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM