简体   繁体   English

Java正则表达式匹配开始/结束标签导致堆栈溢出

[英]Java regex to match start/end tags causes stack overflow

The standard implementation of the Java Pattern class uses recursion to implement many forms of regular expressions (eg, certain operators, alternation). Java Pattern类的标准实现使用递归来实现多种形式的正则表达式(例如,某些运算符、交替)。

This approach causes stack overflow issues with input strings that exceed a (relatively small) length, which may not even be more than 1,000 characters, depending on the regex involved.这种方法会导致超过(相对较小)长度的输入字符串出现堆栈溢出问题,长度甚至可能不超过 1,000 个字符,具体取决于所涉及的正则表达式。

A typical example of this is the following regex using alternation to extract a possibly multiline element (named Data ) from a surrounding XML string, which has already been supplied:一个典型的例子是以下正则表达式使用交替从周围的 XML 字符串中提取可能的多行元素(名为Data ),该字符串已经提供:

<Data>(?<data>(?:.|\r|\n)+?)</Data>

The above regex is used in with the Matcher.find() method to read the "data" capturing group and works as expected, until the length of the supplied input string exceeds 1,200 characters or so, in which case it causes a stack overflow.上述正则表达式与Matcher.find()方法一起使用以读取“数据”捕获组并按预期工作,直到提供的输入字符串的长度超过 1,200 个字符左右,在这种情况下会导致堆栈溢出。

Can the above regex be rewritten to avoid the stack overflow issue?可以重写上面的正则表达式以避免堆栈溢出问题吗?

Some more details on the origin of the stack overflow issue : 有关堆栈溢出问题起源的更多详细信息:

Sometimes the regex Pattern class will throw a StackOverflowError .有时正则表达式Pattern类会抛出StackOverflowError This is a manifestation of the known bug #5050507 , which has been in the java.util.regex package since Java 1.4.这是已知错误 #5050507的表现,自 Java 1.4 以来,该错误一直存在于java.util.regex包中。 The bug is here to stay because it has "won't fix" status.该错误将继续存在,因为它处于“无法修复”状态。 This error occurs because the Pattern class compiles a regular expression into a small program which is then executed to find a match.出现此错误是因为Pattern类将正则表达式编译成一个小程序,然后执行该程序以查找匹配项。 This program is used recursively, and sometimes when too many recursive calls are made this error occurs.这个程序是递归使用的,有时当递归调用太多时会出现这个错误。 See the description of the bug for more details.有关更多详细信息,请参阅错误描述 It seems it's triggered mostly by the use of alternations.似乎它主要是由使用交替触发的。

Your regex (that has alternations) is matching any 1+ characters between two tags.您的正则表达式(有交替)匹配两个标签之间的任何 1+ 个字符。

You may either use a lazy dot matching pattern with the Pattern.DOTALL modifier (or the equivalent embedded flag (?s) ) that will make the .您可以使用带有Pattern.DOTALL修饰符(或等效的嵌入标志(?s) )的惰性点匹配模式,这将使. match newline symbols as well:也匹配换行符:

(?s)<Data>(?<data>.+?)</Data>

See this regex demo看到这个正则表达式演示

However, lazy dot matching patterns still consume lots of memory in case of huge inputs.然而,在大量输入的情况下,惰性点匹配模式仍然会消耗大量内存。 The best way out is to use an unroll-the-loop method :最好的方法是使用unroll-the-loop 方法

<Data>(?<data>[^<]*(?:<(?!/?Data>)[^<]*)*)</Data>

See the regex demo查看正则表达式演示

Details :详情

  • <Data> - literal text <Data> <Data> - 文字文本<Data>
  • (?<data> - start of the capturing group "data" (?<data> - 捕获组“数据”的开始
    • [^<]* - zero or more characters other than < [^<]* - 除<之外的零个或多个字符
    • (?:<(?!/?Data>)[^<]*)* - 0 or more sequences of: (?:<(?!/?Data>)[^<]*)* - 0 个或多个序列:
      • <(?!/?Data>) - a < that is not followed with Data> or /Data> <(?!/?Data>) - 一个<后面没有跟Data>/Data>
      • [^<]* - zero or more characters other than < [^<]* - 除<之外的零个或多个字符
  • ) - end of the "data" group ) - “数据”组的结尾
  • </Data> - closing delimiter </Data> - 结束分隔符

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM