简体   繁体   English

正则表达式中的灾难性回溯

[英]Catastrophic Backtracking in RegEx

I'm trying to find URLs in text using RegEx.我正在尝试使用 RegEx 在文本中查找 URL。 I found a pattern here .我在这里找到了一个模式。 This works well in most cases, but i found a weird test case that causes Catastrophic Backtracking error.这在大多数情况下效果很好,但我发现了一个导致灾难性回溯错误的奇怪测试用例。

You can see my pattern and a test case here .您可以在此处查看我的模式和测试用例。 In this case it works fine but if you add another "," at the end.在这种情况下它工作正常,但如果你在末尾添加另一个“,”。 it gives you error.它给你错误。

I want to know the cause and how to fix it.我想知道原因以及如何解决它。

If all you want is to know what to do to make the pattern safer and less catastrophic backtracking prone, you need to replace each (?:x+|\(xxx\))* pattern look like (?:\(xxx\)|x)* .如果您只想知道如何使模式更安全并减少灾难性的回溯,您需要将每个(?:x+|\(xxx\))*模式替换为(?:\(xxx\)|x)* This greatly reduces the number of steps the regex engine takes.这大大减少了正则表达式引擎执行的步骤数。

So, in this case, you can use所以,在这种情况下,您可以使用

(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:\((?:\([^\s()<>]+\)|[^\s()<>])*\)|[^\s()<>])+(?:\((?:\([^\s()<>]+\)|[^\s()<>])*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))

See the regex demo .请参阅正则表达式演示

Basically, ([^\s()<>]+|(\([^\s()<>]+\)))* is replaced with (?:\((?:\([^\s()<>]+\)|[^\s()<>])*\)|[^\s()<>])+ that matches基本上, ([^\s()<>]+|(\([^\s()<>]+\)))*替换为(?:\((?:\([^\s()<>]+\)|[^\s()<>])*\)|[^\s()<>])+匹配

  • (?: - start of a non-capturing group: (?: - 非捕获组的开始:
    • \( - a ( char \( - 一个(字符
    • (?:\([^\s()<>]+\)|[^\s()<>])* - zero or more occurrences of ( , any one or more chars other than whitespace, ( , ) , < and > , and then a ) or a single char other than whitespace, ( , ) , < and > (?:\([^\s()<>]+\)|[^\s()<>])* - 出现零次或多次( ,除空格以外的任何一个或多个字符, ( , )<> ,然后是一个)或除空格以外的单个字符, ( , ) , <>
    • \) - a ) char \) - a )字符
    • | - or - 或者
    • [^\s()<>] - any char other than whitespace, ( , ) , < and > [^\s()<>] - 除空格、 ( , )<>以外的任何字符
  • )+ - one or more occurrences. )+ - 一次或多次出现。

If you look above the flag section of your link( https://regex101.com/r/oPnn6n/1 ).如果您查看链接的标志部分上方( https://regex101.com/r/oPnn6n/1 )。 You can see that it took 49,235 steps just to match one string!!可以看到仅仅匹配一个字符串就用了 49,235 步!!

To know what causes backtracking here lets break down the expression:要知道是什么原因导致回溯,让我们分解一下表达式:

Your expression can be written as:您的表达式可以写成:

(
(?: ~https?://~ | ~www\d{0,3}[.]~ | ~[a-z0-9.\-]+[.][a-z]{2,4}/~ )   <-Line1 with 3 conditions
(?: ~[^\s()<>]+~ | ~\(([^\s()<>]+|(\([^\s()<>]+\)))*\)~ )+       <-Line2 with 2 conditions
(?: ~\(([^\s()<>]+|(\([^\s()<>]+\)))*\)~ | ~[^\s`!()\[\]{};:\'".,<>?«»“”‘’]~ ) <-Line3 with 2 condition
)

Lets focus on the three lines I've indicated above and By conditions I mean the conditions separated by boolean OR |让我们关注我上面指出的三行,按条件我的意思是用 boolean 或|分隔的条件。 , which have been put between ~ for readability(this wont with on any regex, its just for readability). ,它们被放在~之间以提高可读性(这不会与任何正则表达式一起使用,它只是为了可读性)。 So in group1 I'm calling https?:// , www\d{0,3}[.] and [a-z0-9.\-]+[.][az]{2,4}/ the three conditions, similarly for the next two lines.所以在 group1 中我调用https?:// , www\d{0,3}[.][a-z0-9.\-]+[.][az]{2,4}/这三个条件,类似地用于接下来的两行。

Now lets see how your input: https://test.com/test!!!!!!!!!!!现在让我们看看您的输入方式: https://test.com/test!!!!!!!!!!! is being matched by the regex engine.正在被正则表达式引擎匹配。

  1. https:// part is matched(consumed) by condition1 of line1 and since its a boolean OR, the other two conditions are skipped and regex engine moves over to line2. https://部分被第 1 行的条件 1 匹配(消耗),因为它是 boolean OR,其他两个条件被跳过,正则表达式引擎移到第 2 行。
  2. Whole of test.com/test!!!!!!!!!!!整个test.com/test!!!!!!!!!!! is greedily matched(consumed) by condition 1 of line2.被第 2 行的条件 1 贪婪地匹配(消耗)。 Now that all the string has been consumed backtracking starts , so that the regex engine can try matching something with line3.现在所有的字符串都被消耗掉了,回溯开始了,这样正则表达式引擎就可以尝试用 line3 匹配一些东西。
  3. Engine backtracks one step(uncosumes one ! ) and tries to match the found !引擎回溯一步(取消一个! )并尝试匹配找到的! with line3, neither condition on line3 match !对于第 3 行,第 3 行的两个条件都不匹配! . . So the engine goes back to line 2 again.所以引擎再次回到第 2 行。
  4. Here the condition 1 matches the !这里条件 1 匹配! again, and regex engine moves to line 3.再次,正则表达式引擎移动到第 3 行。

I assume that this repeats till some limit is reached and the engine throws up the error.我假设这会重复直到达到某个限制并且引擎抛出错误。 I could be wrong here though but anyways I hope that you got an idea which part of the expression is causing backtracking.虽然我在这里可能是错的,但无论如何我希望你知道表达式的哪一部分导致回溯。

The best way to avoid backtracking is to be precise about what you are going to match with regex.避免回溯的最好方法是准确说明你要用正则表达式匹配的内容。 In your case, one thing that you could to is to simplify/remove repeated matching conditions.在您的情况下,您可以做的一件事是简化/删除重复的匹配条件。 Like: https://regex101.com/r/I1u8Ho/1 .喜欢: https://regex101.com/r/I1u8Ho/1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM