简体   繁体   English

解决RegEx中的灾难性回溯问题

[英]Resolving Catastrophic Backtracking issue in RegEx

I am using RegEx for finding URL substrings in strings. 我正在使用RegEx在字符串中查找URL子字符串。 The RegEx I am using has been taken from tohster's answer on - What's the cleanest way to extract URLs from a string using Python? 我正在使用的RegEx摘自tohster的答案- 使用Python从字符串中提取URL的最干净方法是什么?

The RE is - 有 -

r'^(?:(?:https?|ftp)://)(?:\S+(?::\S*)?@)?(?:(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:/[^\s]*)?$'

I have done some changes to it - 我已经对其进行了一些更改-

  1. In the IPv4 detection part, I changed the order of the IP range to be found. 在IPv4检测部分,我更改了要查找的IP范围的顺序。 > Precisely, changed [1-9]\\d?|1\\d\\d|2[01]\\d|22[0-3] to 25[0-5]|2[0-4][0-9]|1[0-> 9]{2}|[1-9][0-9]|[0-9] at 2 instances. >准确地,将[1-9]\\d?|1\\d\\d|2[01]\\d|22[0-3]更改为25[0-5]|2[0-4][0-9]|1[0-> 9]{2}|[1-9][0-9]|[0-9]在2个实例中。
  2. Made the https group - (?:https?|ftp):\\/\\/)?(?:\\S+(?::\\S*)?@) optional. 将https组- (?:https?|ftp):\\/\\/)?(?:\\S+(?::\\S*)?@)设为可选。

The final version is - 最终版本是-

(?:(?:https?|ftp):\/\/)?(?:\S+(?::\S*)?@)?(?:((25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])\.){3}(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])|(?:(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:\/[^\s]*)?

The final RE I am using seems to be very promising and has improved significantly as per my requirements(as compared to the original one) and works in Python as well as Java Script, except for the fact that due to the changes I have done have caused the following examples to give "catastrophic backtracking" error - 我正在使用的最终RE看起来非常有前途,并且根据我的要求(与原始要求相比)有了显着改进,并且可以在Python和Java Script中使用,但由于我所做的更改而导致的事实导致以下示例出现"catastrophic backtracking"错误-

asasasasasac31.23.53.122asasassasd asasasasasac31.23.53.122asasassasd

12312312312321.32.34.2312312312321 12312312312321.32.34.2312312312321

12.3423423432.234123123.123 12.3423423432.234123123.123

31.134232131.231.34 31.134232131.231.34

Can be tested at - https://regex101.com/r/i6jDei/1 -可以在测试https://regex101.com/r/i6jDei/1

My contention is that the first example - asasasasasac31.23.53.122asasassasd should have some slick way to pass as the IP is surrounded by non-numeric chars. 我的争辩是,第一个示例asasasasasac31.23.53.122asasassasd应该具有一些asasasasasac31.23.53.122asasassasd传递方式,因为IP周围是非数字字符。

Also, is there a way to pass the first two of the above examples as valid IPv4 addresses? 另外,是否有办法将上述示例中的前两个作为有效的IPv4地址传递?

To resolve ambiguity, I would opt for the largest possible Address, ie, 为了解决歧义,我将选择最大的地址,即

31.23.53.122 31.23.53.122

21.32.34.231 21.32.34.231

The issue of the catastrophic backtracking is caused by the pattern (?:(?:[az\¡-\￿0-9]+-?)*[az\¡-\￿0-9]+)(?:\\.(?:[az\¡-\￿0-9]+-?)*[az\¡-\￿0-9]+)*(?:\\.(?:[az\¡-\￿]{2,})) where (?:[az\¡-\￿0-9]+-?)*[az\¡-\￿0-9]+) will jump through a lot of combinations, if the overall pattern can not be matched. 灾难性回溯的问题是由模式(?:(?:[az\¡-\￿0-9]+-?)*[az\¡-\￿0-9]+)(?:\\.(?:[az\¡-\￿0-9]+-?)*[az\¡-\￿0-9]+)*(?:\\.(?:[az\¡-\￿]{2,}))如果无法匹配整体模式,则(?:[az\¡-\￿0-9]+-?)*[az\¡-\￿0-9]+)将跳过很多组合。 As you can see the character classes are basically the same, so eg for asasasasasac31 it can match like: 如您所见,字符类基本相同,因此对于asasasasasac31它可以像这样匹配:

(asasasasasac31)
(a)(sasasasasac31)
(a)(s)(asasasasac31)
(as)(asasasasac31)

This is not really the way it actually takes, just to show how many combinations exist. 这实际上并不是它实际采取的方式,只是为了显示存在多少组合。

The mistake here seems to be the - being optional which I see no reason for. 这里的错误似乎是-是可选的,我认为没有理由。 If we remove the -, we get it working for your samples (and reduce the number of steps for the already working samples). 如果删除-,我们将使它适用于您的样品(并减少已经使用的样品的步骤数)。

See the updated regex101-demo , where I also added your samples that caused the catastrophic backtracking. 请参阅更新的regex101-demo ,在其中还添加了导致灾难性回溯的示例。

The final pattern then is: 最终的模式是:

(?:(?:https?|ftp):\/\/)?(?:\S+(?::\S*)?@)?(?:((25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])\.){3}(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])|(?:(?:[a-z\u00a1-\uffff0-9]+-)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:\/[^\s]*)?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM