简体   繁体   English

我的正则表达式导致Java中的堆栈溢出; 我错过了什么?

[英]My regex is causing a stack overflow in Java; what am I missing?

I am attempting to use a regular expression with Scanner to match a string from a file. 我试图使用Scanner的正则表达式来匹配文件中的字符串。 The regex works with all of the contents of the file except for this line: 正则表达式适用于该行以外的所有内容:

DNA="ITTTAITATIATYAAAYIYI[....]ITYTYITTIYAIAIYIT"

in the actual file, the ellipsis represents several thousand more characters. 在实际文件中,省略号代表数千个字符。

When the loop that reads the file arrives on the line containing the bases, a stack overflow error occurs. 当读取文件的循环到达包含基数的行时,会发生堆栈溢出错误。

Here is the loop: 这是循环:

while (scanFile.hasNextLine()) {
   final String currentLine = scanFile.findInLine(".*");
   System.out.println("trying to match '" + currentLine + "'");
   Scanner internalScanner = new Scanner(currentLine);
   String matchResult = internalScanner.findInLine(Constants.ANIMAL_INFO_REGEX);
   assert matchResult != null : "there's no reason not to find a match"; 
   matches.put(internalScanner.match().group(1), internalScanner.match().group(2));
   scanFile.nextLine();
  }

and the regex: 和正则表达式:

static final String ANIMAL_INFO_REGEX = "([a-zA-Z]+) *= *\"(([a-zA-Z_.]| |\\.)+)";

Here's the failure trace: 这是失败追踪:

java.lang.StackOverflowError
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3360)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4131)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4312)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4244)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4095)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3362)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4131)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4312)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4244)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4095)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3362)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4131)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4312)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4244)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4095)
    ...etc (it's all regex).

Thanks so much! 非常感谢!

This looks like bug 5050507 . 这看起来像bug 5050507 I agree with Asaph that removing the alternation should help; 我同意Asaph的说法,取消交替应该有所帮助; the bug specifically says "Avoid alternation whenever possible". 该bug专门说“尽可能避免交替”。 I think you can go probably even simpler: 我想你可能会更简单:

"^([a-zA-Z]+) *= *\"([^\"]+)"

Try this simplified version of your regex that removes some unnecessary | 试试这个正则表达式的简化版本,删除一些不必要的| operators (which might have been causing the regex engine to do a lot of branching) and includes beginning and end of line anchors. 运算符(可能导致正则表达式引擎执行大量分支)并包括行锚的开始和结束。

static final String ANIMAL_INFO_REGEX = "^([a-zA-Z]+) *= *\"([a-zA-Z_. ]+)\"$";

阅读本文以了解问题: http//www.regular-expressions.info/catastrophic.html ...然后使用其他建议之一

As the others have said, your regex is much less efficient than it should be. 正如其他人所说的那样,你的正则表达式效率远低于应有的效率。 I'd take it a step further and use possessive quantifiers: 我会更进一步,使用占有量词:

"^([a-zA-Z]++) *+= *+\"([^\"]++)\"$"

But the way you're using the Scanner doesn't make much sense, either. 但是你使用扫描仪的方式也没有多大意义。 There's no need to use findInLine(".*") to read the line; 没有必要使用findInLine(".*")来读取该行; that's what nextLine() does. 这就是nextLine()作用。 And you don't need to create another Scanner to apply your regex; 而且您不需要创建另一个扫描程序来应用您的正则表达式; just use a Matcher. 只需使用匹配器。

static final Pattern ANIMAL_INFO_PATTERN = 
    Pattern.compile("^([a-zA-Z]++) *+= *+\"([^\"]++)\"$");

... ...

  Matcher lineMatcher = ANIMAL_INFO_PATTERN.matcher("");
  while (scanFile.hasNextLine()) {
    String currentLine = scanFile.nextLine();
    if (lineMatcher.reset(currentLine).matches()) {
      matches.put(lineMatcher.group(1), lineMatcher.group(2));
    }
  }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM