简体   繁体   English

Java正则表达式在堆栈溢出时死亡:需要更好的版本

[英]Java regex dies on stack overflow: need a better version

I'm working on a JMD (Java MarkDown) (a Java port of MarkDownSharp ) but I'm having an issue with one regex in particular. 我工作的一个JMD(Java的降价) (的渣口MarkDownSharp ),但我在遇到一个特别的正则表达式的问题。 For the file Markdown_Documentation_Syntax.text this regular expression dies: 对于文件Markdown_Documentation_Syntax.text,这个正则表达式会死掉:

private static final String BLOCK_TAGS_1 = "p|div|h[1-6]|blockquote|pre|table|dl|ol|ul|script|noscript|form|fieldset|iframe|math|ins|del";
private static final String BLOCKS_NESTED_PATTERN = String.format("" +
        "(" +                      // save in $1
        "^" +                      // start of line (with MULTILINE)
        "<(%s)" +                  // start tag = $2
        "\\b" +                    // word break
        "(.*\\n)*?" +              // any number of lines, minimally matching
        "</\\2>" +                 // the matching end tag
        "[ \\t]*" +                // trailing spaces/tags
        "(?=\\n+|\\Z)" +           // followed by a newline or end of
        ")", BLOCK_TAGS_1);

which translates to: 这意味着:

(^<(p|div|h[1-6]|blockquote|pre|table|dl|ol|ul|script|noscript|form|fieldset|iframe|math|ins|del)\b(.*\n)*?</\2>[ \t]*(?=\n+|\Z))

This pattern is looking for accepted block tags that are anchored to the start of a line, followed by any number of lines and then are terminated by a matching tag followed by a newline or a string terminator. 此模式正在查找锚定到行开头的接受块标记,后跟任意数量的行,然后由匹配标记后跟换行符或字符串终止符终止。 This generates: 这会产生:

java.lang.StackOverflowError
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4357)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4357)
        ...

This can be dealt with by increasing the stack space for Java (defaults to 128k/400k for oss/ss IIRC) but the above expression is slow anyway. 这可以通过增加Java的堆栈空间来处理(默认为oss / ss IIRC的128k / 400k),但无论如何上面的表达式都很慢。

So I'm looking for a regex guru who can do better (or at least explain the performance problem with this pattern). 所以我正在寻找能够做得更好的正则表达大师(或者至少用这种模式解释性能问题)。 The C# version is a little slow but works fine. C#版本有点慢,但工作正常。 PHP seems to have no issues with this either. PHP似乎也没有问题。

Edit: This is on JDK6u17 running on Windows 7 64 Ultimate. 编辑:这是在Windows 7 64 Ultimate上运行的JDK6u17上。

This part: 这部分:

(.*\n)*?

will involve A LOT of unnecessary backtracking because of the nested * and since there are chars that have to match afterwards. 因为嵌套*会涉及很多不必要的回溯,因为之后必须匹配的字符。

I just ran a quick benchmark in perl on some arbitrary strings and got a 13-15% improvement just by switching that piece to 我只是在一些任意字符串上运行perl的快速基准测试,只需将该部分切换为13-15%即可

(?>.*\n)*?

which does non-capturing, independent subgrouping. 这是非捕获,独立的子组。 That gives you two benefits, it no longer wastes time capturing the matching string, and more importantly, it no longer backtracks on the innermost .* which is a waste of time anyway. 这给你带来两个好处,它不再浪费时间捕获匹配的字符串,更重要的是,它不再在最里面回溯.*这无论如何浪费时间。 There's no way that only a portion of that .* will ever result in a valid match so explicitly making it all or nothing should help. 没有办法只有那部分。*会产生有效的匹配,所以明确地将它全部或全部都没有帮助。

Don't know if that's a sufficient improvement in this case, however. 但是,不知道在这种情况下这是否足够改进。

While improving the pattern does help and is advisable, Java's pattern matcher is recursive and it is generally best to switch to an iterative solution. 虽然改进模式确实有帮助并且是可取的,但Java的模式匹配器是递归的,通常最好切换到迭代解决方案。

When I had similar problems, I switched to jregex ( http://jregex.sourceforge.net/ ) and that worked for me. 当我遇到类似问题时,我切换到jregex( http://jregex.sourceforge.net/ ),这对我有用。

The pattern match may have succeeded now with the improved solution, but it may fail if a text 10 times as big was given. 现在使用改进的解决方案可以成功进行模式匹配,但如果给出10倍大的文本,它可能会失败。

PS: Sorry for necromancing an old topic but this thread is ranked highly on google and it would benefit people if I put it here PS:很抱歉有一个老话题,但这个帖子在谷歌上排名很高,如果我把它放在这里会对人们有所帮助

The sub-expression: "(.*\\\\n)*?" 子表达式: "(.*\\\\n)*?" (and the improved accepted answer version: "(?>.*\\n)*?" ), both have a problem: They fail to match a block element written on one line. (以及改进的接受答案版本: "(?>.*\\n)*?" ),都有问题:它们无法匹配写在一行上的块元素。 In other words, they fail to match this: 换句话说,他们不符合这个:

<div>one-liner</div>

If this is not the desired behavior, a correct (and much more efficient) solution is to simply use: 如果这不是理想的行为,那么正确(并且更有效)的解决方案就是简单地使用:

.*?

And turn on single line mode. 并打开单线模式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM