为什么这个正则表达式有时会卡住并冻结我的程序？我可以使用什么替代方案？

Question

import re

input_text_to_check = str(input()) #Input

regex_patron_m1 = r"\s*((?:\w+\s*)+) \s*\¿?(?:would not be what |would not be that |would not be that |would not be the |would not be this |would not be the |would not be some)\s*((?:\w+\s*)+)\s*\??"
m1 = re.search(regex_patron_m1, input_text_to_check, re.IGNORECASE) #Con esto valido la regex haber si entra o no en el bloque de code

#Validation
if m1:
    word, association = m1.groups()
    word = word.strip()
    association = association.strip()

    print(repr(word))
    print(repr(association))

I think that although the regex is somewhat long, for a modern PC it should not be too much work to validate that 10 or 20 options in the (?: | | | | ) That's why I thought that the problem could be in first \s*((?:\w+\s*)+) \s* and/or in the last \s*((?:\w+\s*)+)\s*我认为虽然正则表达式有点长，但对于现代 PC 来说，验证(?: | | | | )中的 10 或 20 个选项应该不会有太多工作，这就是为什么我认为问题可能首先出现在\s*((?:\w+\s*)+) \s*和/或在最后\s*((?:\w+\s*)+)\s*

The following is an example of an input that causes the regular expression got stuck:以下是导致正则表达式卡住的输入示例：

"the blue skate would not be that product that you want buy now"

And this is an example where it doesn't crash: "the blue skate would not be that product"这是一个不会崩溃的例子： "the blue skate would not be that product"

And give me the words that I want extract:并给我我想要提取的话：

'the blue skate'
'product'

Is there an alternative to be able to extract what is in front of and behind those options?是否有替代方案能够提取这些选项前后的内容？ and that it does not crash sometimes?并且它有时不会崩溃？ what could be the reason of the problem with this regex that I made?我制作的这个正则表达式出现问题的原因可能是什么？

Answer 1

Based on this explenation of 'Catastrophic Backtracking' I think the issue with your regex is the following:基于对“灾难性回溯”的解释，我认为您的正则表达式的问题如下：

The thing you try to match with ((?:\w+\s*)+) can be matched in multiple ways.您尝试用((?:\w+\s*)+)匹配的东西可以通过多种方式匹配。 Let's say you use ((?:\w+\s*)+) on the input string abc .假设您在输入字符串abc上使用((?:\w+\s*)+) 。 This can be matched in many ways:这可以通过多种方式匹配：

( a and 0 whitespaces)( b and 0 whitespaces)( c and 0 whitespaces) （ a和0个空格）（ b和0个空格）（ c和0个空格）
( a and 0 whitespaces)( bc and 0 whitespaces) （ a和0个空格）（ bc和0个空格）
( ab and 0 whitespaces)( c and 0 whitespaces) （ ab和0个空格）（ c和0个空格）

As long as you only need to match ((?:\w+\s*)+) this works fine.只要你只需要匹配((?:\w+\s*)+)就可以了。 But when you add something else afterwards (like the 10 or so options in your case) regex needs to do some heavy recusion.但是当你之后添加其他东西时（比如你的情况下的 10 个左右的选项）正则表达式需要做一些大量的回避。 Have a look at the provided link for a better explanation.查看提供的链接以获得更好的解释。

Removing the + after both the \w results in a working regex for the two cases provided:在\w之后删除+会导致提供的两种情况的工作正则表达式：


"\s*((?:\w\s*)+) \s*\¿?(?:would not be what |would not be that |would not be that |would not be the |would not be this |would not be the |would not be some)\s*((?:\w\s*)+)\s*\??"gm

Does this work on your device and for all your test cases?这是否适用于您的设备和所有测试用例？

为什么这个正则表达式有时会卡住并冻结我的程序？我可以使用什么替代方案？

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-02-15 16:05:34

为什么这个正则表达式有时会卡住并冻结我的程序？ 我可以使用什么替代方案？

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-02-15 16:05:34

为什么这个正则表达式有时会卡住并冻结我的程序？我可以使用什么替代方案？

解决方案1
1 已采纳 2022-02-15 16:05:34