简体   繁体   English

为了避免灾难性回溯,我应该允许的最大长度是多少?

[英]What's the maximum length should I allow to avoid Catastrophic backtracking?

The line is approximately 7915621 in length and is actually the view state value of an ASPX website.该行的长度约为 7915621,实际上是 ASPX 网站的视图 state 值。

I get the original HTML of the site, then pass it line by line to the extract function, and as soon as it reaches the view_state line containing that long string, the regex become stuck.我得到了该站点的原始 HTML,然后将其逐行传递给提取 function,一旦到达包含该长字符串的 view_state 行,正则表达式就会卡住。

Here is the regex pattern that get stuck,这是卡住的正则表达式模式,

/[\w\.]+\@[\w]+(?:\.[\w]{3}|\.[\w]{2}\.[\w]{2})\b/gi

I thought about setting a maximum line length to skip this line or any other lines like that but I can't think of a optimal size as I care about false positives.我考虑过设置最大行长度以跳过此行或任何其他类似的行,但我想不出最佳尺寸,因为我关心误报。

[\w\.]+ is found so many times in your document that it becomes a problem to process them with your expression. [\w\.]+在您的文档中出现了很多次,以至于用您的表达式处理它们成为一个问题。

Reducing the amount of places to start searching at is a possible solution.减少开始搜索的地点数量是一种可能的解决方案。 Eg using a word boundary.例如使用单词边界。

(?:\.\w{3}|\.\w{2}\.\w{2}) can be streamlined as \.\w{2}(?:\w|\.\w{2}) . (?:\.\w{3}|\.\w{2}\.\w{2})可以简化为\.\w{2}(?:\w|\.\w{2}) .

Use利用

/\b[\w.]+@\w+\.\w{2}(?:\w|\.\w{2})\b/gi

Or, get rid of the brackets或者,去掉括号

/\b\w+(?:\.\w+)*@\w+\.\w{2}(?:\w|\.\w{2})\b/gi

EXPLANATION解释

--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
--------------------------------------------------------------------------------
  \w+                      word characters (a-z, A-Z, 0-9, _) (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (?:                      group, but do not capture (0 or more times
                           (matching the most amount possible)):
--------------------------------------------------------------------------------
    \.                       '.'
--------------------------------------------------------------------------------
    \w+                      word characters (a-z, A-Z, 0-9, _) (1 or
                             more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )*                       end of grouping
--------------------------------------------------------------------------------
  @                        '@'
--------------------------------------------------------------------------------
  \w+                      word characters (a-z, A-Z, 0-9, _) (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  \.                       '.'
--------------------------------------------------------------------------------
  \w{2}                    word characters (a-z, A-Z, 0-9, _) (2
                           times)
--------------------------------------------------------------------------------
  (?:                      group, but do not capture:
--------------------------------------------------------------------------------
    \w                       word characters (a-z, A-Z, 0-9, _)
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    \.                       '.'
--------------------------------------------------------------------------------
    \w{2}                    word characters (a-z, A-Z, 0-9, _) (2
                             times)
--------------------------------------------------------------------------------
  )                        end of grouping
--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM