简体   繁体   English

URL处理中的Python正则表达式灾难性回溯

[英]Python Regex Catastrophic Backtracking in URL handling

I want to write a regex to capture URLs in a text. 我想编写一个正则表达式来捕获文本中的URL。 Now, the problem is any decent regex I use for capturing URLs, encounter a catastrophic backtracking with some URLs. 现在,问题是我用来捕获URL的任何正则表达式都遇到了一些URL的灾难性回溯

I have tried "diegoperini" regex in here and also have read other questions and answers in here , here , and here . 我在这里尝试过“ diegoperini”正则表达式,在这里这里这里也阅读了其他问题和答案。 However none of them solved my problem. 但是他们都没有解决我的问题。

Also I have three other regexes: 另外我还有其他三个正则表达式:

Regex:
SIMPLE_URL_REGEX = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*(),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
WEB_URL_REGEX = r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))"""
ANY_URL_REGEX = r"""(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'\".,<>?«»“”‘’]))"""

The Simple URL regex does not get trapped in cases that I tried, but also does not perform very good and miss many URLs, the other two perform better but get trapped in some cases. 在我尝试过的情况下,Simple URL regex不会被困住,但效果也不佳,并且会丢失很多URL,另外两个则表现更好,但在某些情况下会被困住。

Now, one part of my problem was encoding of non-ASCII URLs, which apparently were solved by decoding the text like this: 现在,我的问题的一部分是对非ASCII URL的编码,显然可以通过解码这样的文本来解决:

try:
    meta = unquote(meta.encode('utf-8')).decode('utf-8')
except TypeError:
    meta = unquote(meta)

But sometime later another problematic URL came up. 但是过了一段时间,另一个有问题的网址出现了。 something like this one: 像这样的东西:

https://www.example.net/ar/book//%DA%A9-%D8%B3-(%D9%81-%DB%8C-%DB%8C-%DB%8C-%DB%8C-%DB%8C-%DB%8C-%D9%85)

which are rare, but when happen, trigger a very inefficient backtracking. 这很少见,但是一旦发生,就会触发非常低效的回溯。 This backtracking cause the program to stop responding indefinitely. 这种回溯会导致程序无限期地停止响应。 (As I have read in here the problem is that the regex module does not release the GIL.) (正如我在这里阅读的那样,问题在于正则表达式模块不会释放GIL。)

Considering all these information, I have two questions : 考虑到所有这些信息,我有两个问题

  • First, Is it possible to have / is there a regex pattern for matching URLs that perform reasonably and avoid catastrophic backtracking completely ? 首先,是否有可能/有正则表达式模式来匹配性能合理完全 避免灾难性回溯的URL?

  • Second, If there is not such a regex, Is there another way to catch the cases where the regex get trapped and throw an exception or bypass it some other way? 其次,如果没有这样的正则表达式,是否还有另一种方法来捕获正则表达式被困并抛出异常或以其他方式绕过它的情况?

This one uses the \\x{XXXX} notation for Unicode chars, substitute whatever Java uses. 该代码使用\\x{XXXX}表示Unicode字符,替代Java使用的任何字符。

Also, the biggest issue would be boundary's, the things around the URL. 同样,最大的问题是URL 周围的边界。

The one below uses a whitespace boundary, though you can remove it and try your luck. 下面的一个使用空白边界,尽管您可以删除它并尝试运气。

"(?i)(?<!\\S)(?!mailto:)(?:[a-z]*:\\/\\/)?(?:\\S+(?::\\S*)?@)?(?:(?:(?:[1-9]\\d?|1\\d\\d|2[01]\\d|22[0-3])(?:\\.(?:1?\\d{1,2}|2[0-4]\\d|25[0-5])){2}(?:\\.(?:[1-9]\\d?|1\\d\\d|2[0-4]\\d|25[0-4]))|(?:(?:[a-z\\x{a1}-\\x{ffff}0-9]+-?)*[a-z\\x{a1}-\\x{ffff}0-9]+)(?:\\.(?:[a-z\\x{a1}-\\x{ffff}0-9]+-?)*[a-z\\x{a1}-\\x{ffff}0-9]+)*(?:\\.(?:[a-z\\x{a1}-\\x{ffff}]{2,})))|localhost)(?::\\d{2,5})?(?:\\/[^\\s]*)?(?!\\S)"

Formatted 格式化的

 (?i)
 (?<! \S )
 (?! mailto: )
 (?:
      [a-z]* :
      \/\/
 )?
 (?:
      \S+ 
      (?: : \S* )?
      @
 )?
 (?:
      (?:
           (?:
                [1-9] \d? 
             |  1 \d\d 
             |  2 [01] \d 
             |  22 [0-3] 
           )
           (?:
                \.
                (?: 1? \d{1,2} | 2 [0-4] \d | 25 [0-5] )
           ){2}
           (?:
                \.
                (?:
                     [1-9] \d? 
                  |  1 \d\d 
                  |  2 [0-4] \d 
                  |  25 [0-4] 
                )
           )
        |  (?:
                (?: [a-z\x{a1}-\x{ffff}0-9]+ -? )*
                [a-z\x{a1}-\x{ffff}0-9]+ 
           )
           (?:
                \.
                (?: [a-z\x{a1}-\x{ffff}0-9]+ -? )*
                [a-z\x{a1}-\x{ffff}0-9]+ 
           )*
           (?:
                \.
                (?: [a-z\x{a1}-\x{ffff}]{2,} )
           )
      )
   |  localhost
 )
 (?: : \d{2,5} )?
 (?: \/ [^\s]* )?
 (?! \S )

After extensive search, I found a semi-solution for my problem. 经过广泛的搜索,我找到了解决我的问题的半解决方案。

This solution does not change the regex in question, but uses a Timeout Error for raising exception when the regex stuck in backtracking. 此解决方案不会更改所讨论的正则表达式,但是当正则表达式陷入回溯时,会使用超时错误引发异常。

I added The Package timeout-decorator and wrote something like this: 我添加了Package timeout-decorator并编写了如下内容:

from timeout_decorator import timeout, TimeoutError

@timeout(seconds=RE_TIMEOUT)
def match_regex_timeout(compiled_regex, replacer, data):
    return compiled_regex.sub(replacer, data)

The use of function would be something like this: 函数的使用将如下所示:

import logging
logger = logging.getLogger(__name__)

url_match = re.compile(url_regex, flags=re.MULTILINE)
replacer = ' URL '

try:
    text = match_regex_timeout(url_match, replacer, text)
except TimeoutError:
    logging.error('REGEX TIMEOUT ERROR: can not parse URL')
    text = remove_big_tokens(text)

Which basically try to parse the text, and if fail to do that in expected time, would resort to removing big tokens of text, that are likely to be the problematic URLs. 基本上是尝试解析文本,如果未能在预期的时间内完成解析,将诉诸于删除大的文本标记,而这些标记很可能是有问题的URL。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM