Java正则表达式在堆栈溢出时死亡：需要更好的版本

Question

I'm working on a JMD (Java MarkDown) (a Java port of MarkDownSharp ) but I'm having an issue with one regex in particular. 我工作的一个JMD（Java的降价）（的渣口MarkDownSharp ），但我在遇到一个特别的正则表达式的问题。 For the file Markdown_Documentation_Syntax.text this regular expression dies: 对于文件Markdown_Documentation_Syntax.text，这个正则表达式会死掉：

private static final String BLOCK_TAGS_1 = "p|div|h[1-6]|blockquote|pre|table|dl|ol|ul|script|noscript|form|fieldset|iframe|math|ins|del";
private static final String BLOCKS_NESTED_PATTERN = String.format("" +
        "(" +                      // save in $1
        "^" +                      // start of line (with MULTILINE)
        "<(%s)" +                  // start tag = $2
        "\\b" +                    // word break
        "(.*\\n)*?" +              // any number of lines, minimally matching
        "</\\2>" +                 // the matching end tag
        "[ \\t]*" +                // trailing spaces/tags
        "(?=\\n+|\\Z)" +           // followed by a newline or end of
        ")", BLOCK_TAGS_1);

which translates to: 这意味着：

(^<(p|div|h[1-6]|blockquote|pre|table|dl|ol|ul|script|noscript|form|fieldset|iframe|math|ins|del)\b(.*\n)*?</\2>[ \t]*(?=\n+|\Z))

This pattern is looking for accepted block tags that are anchored to the start of a line, followed by any number of lines and then are terminated by a matching tag followed by a newline or a string terminator. 此模式正在查找锚定到行开头的接受块标记，后跟任意数量的行，然后由匹配标记后跟换行符或字符串终止符终止。 This generates: 这会产生：

java.lang.StackOverflowError
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4357)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4357)
        ...

This can be dealt with by increasing the stack space for Java (defaults to 128k/400k for oss/ss IIRC) but the above expression is slow anyway. 这可以通过增加Java的堆栈空间来处理（默认为oss / ss IIRC的128k / 400k），但无论如何上面的表达式都很慢。

So I'm looking for a regex guru who can do better (or at least explain the performance problem with this pattern). 所以我正在寻找能够做得更好的正则表达大师（或者至少用这种模式解释性能问题）。 The C# version is a little slow but works fine. C＃版本有点慢，但工作正常。 PHP seems to have no issues with this either. PHP似乎也没有问题。

Edit: This is on JDK6u17 running on Windows 7 64 Ultimate. 编辑：这是在Windows 7 64 Ultimate上运行的JDK6u17上。

Answer 1

This part: 这部分：

(.*\n)*?

will involve A LOT of unnecessary backtracking because of the nested * and since there are chars that have to match afterwards. 因为嵌套*会涉及很多不必要的回溯，因为之后必须匹配的字符。

I just ran a quick benchmark in perl on some arbitrary strings and got a 13-15% improvement just by switching that piece to 我只是在一些任意字符串上运行perl的快速基准测试，只需将该部分切换为13-15％即可

(?>.*\n)*?

which does non-capturing, independent subgrouping. 这是非捕获，独立的子组。 That gives you two benefits, it no longer wastes time capturing the matching string, and more importantly, it no longer backtracks on the innermost .* which is a waste of time anyway. 这给你带来两个好处，它不再浪费时间捕获匹配的字符串，更重要的是，它不再在最里面回溯.*这无论如何浪费时间。 There's no way that only a portion of that .* will ever result in a valid match so explicitly making it all or nothing should help. 没有办法只有那部分。*会产生有效的匹配，所以明确地将它全部或全部都没有帮助。

Don't know if that's a sufficient improvement in this case, however. 但是，不知道在这种情况下这是否足够改进。

Answer 2

While improving the pattern does help and is advisable, Java's pattern matcher is recursive and it is generally best to switch to an iterative solution. 虽然改进模式确实有帮助并且是可取的，但Java的模式匹配器是递归的，通常最好切换到迭代解决方案。

When I had similar problems, I switched to jregex ( http://jregex.sourceforge.net/ ) and that worked for me. 当我遇到类似问题时，我切换到jregex（ http://jregex.sourceforge.net/ ），这对我有用。

The pattern match may have succeeded now with the improved solution, but it may fail if a text 10 times as big was given. 现在使用改进的解决方案可以成功进行模式匹配，但如果给出10倍大的文本，它可能会失败。

PS: Sorry for necromancing an old topic but this thread is ranked highly on google and it would benefit people if I put it here PS：很抱歉有一个老话题，但这个帖子在谷歌上排名很高，如果我把它放在这里会对人们有所帮助

Answer 3

The sub-expression: "(.*\\\\n)*?" 子表达式： "(.*\\\\n)*?" (and the improved accepted answer version: "(?>.*\\n)*?" ), both have a problem: They fail to match a block element written on one line. （以及改进的接受答案版本： "(?>.*\\n)*?" ），都有问题：它们无法匹配写在一行上的块元素。 In other words, they fail to match this: 换句话说，他们不符合这个：

<div>one-liner</div>

If this is not the desired behavior, a correct (and much more efficient) solution is to simply use: 如果这不是理想的行为，那么正确（并且更有效）的解决方案就是简单地使用：

.*?

And turn on single line mode. 并打开单线模式。

Java正则表达式在堆栈溢出时死亡：需要更好的版本

问题描述

3 个解决方案

解决方案1
16 已采纳 2010-01-04 05:14:59

解决方案2
2 2011-04-22 14:32:04

解决方案3
0 2011-04-22 16:01:08

Java正则表达式在堆栈溢出时死亡：需要更好的版本

问题描述

3 个解决方案

解决方案1 16 已采纳 2010-01-04 05:14:59

解决方案2 2 2011-04-22 14:32:04

解决方案3 0 2011-04-22 16:01:08

解决方案1
16 已采纳 2010-01-04 05:14:59

解决方案2
2 2011-04-22 14:32:04

解决方案3
0 2011-04-22 16:01:08