简体   繁体   English

Pattern.matches()给出了StackOverflowError

[英]Pattern.matches() gives StackOverflowError

I'm using java's Pattern.matches to match a block of data to a regex. 我正在使用java的Pattern.matches将数据块与正则表达式进行匹配。 The block of data can be a single line or multiple lines. 数据块可以是单行或多行。 The problem is that once my data becomes more than 15 lines (typically more than 17-18 lines), i start getting stackoverflowerror. 问题是,一旦我的数据超过15行(通常超过17-18行),我开始得到stackoverflower。 For data less than 15 lines the regex works fine. 对于少于15行的数据,正则表达式工作正常。

The Regex is of this format: 正则表达式是这种格式:
domainname -> space -> , -> space -> number -> space -> , -> space -> number -> newline domainname - > space - >, - > space - > number - > space - >, - > space - > number - > newline

String regex = "^(([a-zA-Z0-9][a-zA-Z0-9\\-]*\\.)+([a-zA-Z]{2,})\\s*,\\s*\\d+\\s*,\\s*\\d+(\\r?\\n)?)+$";

The data block i use to test against this regex is this 我用来测试这个正则表达式的数据块就是这个

abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456

This is the code: 这是代码:

String regex = "^(([a-zA-Z0-9][a-zA-Z0-9\\-]*\\.)+([a-zA-Z]{2,})\\s*,\\s*\\d+\\s*,\\s*\\d+(\\r?\\n)?)+$";
boolean valid = Pattern.matches(regex, data); //fails here

I can't tell you the reason for this error; 我无法告诉你这个错误的原因; the regex itself is fine and not subject to catastrophic backtracking or any other obvious error. 正则表达式本身很好,不会受到灾难性的回溯或任何其他明显的错误。

Perhaps you can reduce the number of backtracking positions the regex engine saves by using possessive quantifiers ( ++ instead of + , *+ instead of * , {2,}+ instead of {2,} etc.). 也许你可以通过使用所有格量词来减少正则表达式引擎节省的回溯位置的数量( ++而不是+*+而不是*{2,}+而不是{2,}等)。 Also, you don't need the capturing groups (thanks Thomas), so I've changed them into non-capturing ones: 此外,您不需要捕获组(感谢Thomas),所以我将它们更改为非捕获组:

"(?:(?:[a-zA-Z0-9][a-zA-Z0-9-]*+\\.)++([a-zA-Z]{2,}+)\\s*+,\\s*+\\d++\\s*+,\\s*+\\d++(\r?+\n)?+)++"

This won't change the behaviour of the regex (except for the removal of the unnecessary anchors since you're using Pattern.matches() ), but perhaps it helps avoid StackOverflows. 这不会改变正则表达式的行为(除了因为你正在使用Pattern.matches()而删除不必要的锚),但它可能有助于避免StackOverflows。 I don't have a Java SDK installed, so I can't test it myself, though. 我没有安装Java SDK,所以我不能自己测试它。

You might try and use atomic groups ( (?>expression) ) to prevent backtracking: 您可以尝试使用原子组( (?>expression) )来防止回溯:

Here's a test that failed with a block of 1000 lines using your regex but succeeds now (takes a while, thus I only tested with 5000 20000 :) ): 这是一个测试,使用正则表达式使用1000行的块失败但现在成功(需要一段时间,因此我只测试了 5000 20000 :)):

String regex = "(?>(?>[a-zA-Z0-9][a-zA-Z0-9\\-]*\\.)+(?>[a-zA-Z]{2,})\\s*,\\s*\\d+\\s*,\\s*\\d+(?>\\r?\\n)?)+";

StringBuilder input = new StringBuilder();

for( int i = 0; i < 1000000; ++i) {
  input.append("abc.com, 123, 456\n");
}

Pattern p = Pattern.compile( regex );
Matcher m = p.matcher( input );

System.out.println(m.matches());

So after all, it might still be a backtracking problem. 毕竟,它可能仍然是一个回溯问题。

Update : just let that test run with 20000 lines and still didn't fail. 更新 :让测试运行20000行仍然没有失败。 That's at least 20 times as much as before. 这至少是以前的20倍。 :) :)

Update 2 : looking at my test again I found the slow part, the string concatenation. 更新2 :再次查看我的测试我找到了缓慢的部分,字符串连接。 (o..O). (o..O)。 I've updated the test and used 1 Million lines, still no fail. 我更新了测试并使用了1百万行,但仍然没有失败。 :) :)

The problem is that your regex is too complicated. 问题是你的正则表达式太复杂了。 Each line of input that you process results in (I think) 10 backtrack points, and at least some of these seem to be handled by the regex engine recursing. 您处理的每一行输入都会导致(我认为)10个回溯点,并且至少其中一些似乎由正则表达式引擎递归处理。 That could be a few hundred stack frames which would be enough to give you StackOverflowError . 这可能是几百个堆栈帧,足以给你StackOverflowError

IMO, you need to modify the pattern so that it will match one group / line of data. IMO,您需要修改模式,使其匹配一组/一组数据。 Then call Matcher.find repeatedly to parse each line. 然后反复调用Matcher.find来解析每一行。 I expect that you will find that this is faster. 我希望你会发现这更快。


Optimizing the regex in other ways while still trying to match the entire block in one go probably won't work. 以其他方式优化正则表达式,同时仍然试图一次性匹配整个块可能不起作用。 You may be able to get it to match N times more lines of data, but as you increase the number of lines in the input you are likely to run into the same problem again. 您可以使其匹配N次更多数据行,但随着您增加输入中的行数,您可能会再次遇到同样的问题。

And even if you do get it to work as a multi-line regex, there is a chance that it won't work with other implementations of the Java regex libraries; 即使你确实让它作为多行正则表达式工作,它也有可能无法与Java正则表达式库的其他实现一起工作; eg in older Oracle JREs or non-Oracle implementations. 例如,在较旧的Oracle JRE或非Oracle实现中。


I agree with the other answers that this is not an example of "catastrophic backtracking". 我同意其他答案,这不是“灾难性回溯”的一个例子。 Rather it is an interaction between the way that the regex engine handles backtrack points, and the fact that there are simply too many of them when you give it multiple lines of input. 相反,它是正则表达式引擎处理回溯点的方式之间的交互,以及当你给它多行输入时它们中有太多它们的事实。

I've reproduced this problem, but only for much larger strings. 我已经重现了这个问题,但只针对更大的字符串。

$ java -version
java version "1.6.0_22"
OpenJDK Runtime Environment (IcedTea6 1.10.2)    (6b22-1.10.2-0ubuntu1~11.04.1)
OpenJDK 64-Bit Server VM (build 20.0-b11, mixed mode)

My test code: 我的测试代码:

public class Testje
{
    public static void main(String... args)
    {
        String regex = "^(([a-zA-Z0-9][a-zA-Z0-9\\-]*\\.)+([a-zA-Z]{2,})\\s*,\\s*\\d+\\s*,\\s*\\d+(\\r?\\n)?)+$";
        String data = "";
        for (int i = 0; i<224; i++) data += "abc.com, 123, 456\n";
        System.out.println(data.matches(regex));
    }
}

For anything smaller than 224 in that for loop, the code runs fine. 对于for循环中小于224的任何东西,代码运行正常。 For 224 and more copies of that line, I get a huge stack trace. 对于该行的224个以上的副本,我得到了一个巨大的堆栈跟踪。

Oh, note that using (?: groups does not change the size of the string that still works. 哦,请注意使用(?:groups)不会改变仍然有效的字符串的大小。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM