简体   繁体   English

尝试在Java中使用正则表达式时出现堆栈溢出

[英]Stack overflow when trying to use regex in java

I have read up on some articles on how to optimize regex but none of the answers (less groups, using {X,Y} instead of *) seemed to stop my regex from getting a stack overflow error. 我已经阅读了一些有关如何优化正则表达式的文章,但没有一个答案(较少的组,使用{X,Y}而不是*)似乎阻止了我的正则表达式出现堆栈溢出错误。

I am trying to make a dynamic search through a file. 我正在尝试通过文件进行动态搜索。 Lets say i am searching for 'i bet you cannot find me' in a file that is pretty large (2-4 mb). 可以说我在一个很大的文件(2-4 mb)中搜索“我敢打赌你找不到我”。 My regex generator would generate the regex: 我的正则表达式生成器将生成正则表达式:

i(?:.|\s)*?bet(?:.|\s)*?you(?:.|\s)*?cannot(?:.|\s)*?find(?:.|\s)*?me

the idea to this regex is that it finds the exact phrase no matter what characters or white space comes between the words. 这个正则表达式的想法是,无论单词之间有什么字符或空格,它都能找到确切的短语。 However when i try to use: 但是,当我尝试使用时:

Pattern p = Pattern.compile(generatedRegex, Pattern.MULTILINE);
Matcher m = p.matcher(fileContentsAsString);
while (m.find()) {
System.out.println(m.group())
}

I am getting a stack overflow error. 我收到堆栈溢出错误。 I know that regex use recursion but it doesnt seem like this is that bad of a regex. 我知道正则表达式使用递归,但似乎这不是正则表达式的缺点。 Is there any way I can optimize this regex? 有什么办法可以优化此正则表达式? Thanks! 谢谢!

ANSWER: 回答:

Pattern p = Pattern.compile("i(?:.*)bet(?:.*)you(?:.*)cannot(?:.*)find(?:.*?)me", Pattern.DOTALL);

is the pattern/regex that I ultimately am using. 是我最终使用的模式/正则表达式。 Seems fast and no longer getting a stack overflow exception 似乎很快,不再出现堆栈溢出异常

I think you are getting a lot of backtracking because of your reluctant qualifiers (*?) . 我认为您由于不愿意使用限定词(*?)而获得大量回溯。 One way to prevent backtracking is to use atomic grouping (?>X) , and/or possessive qualifier (*+) . 防止回溯的一种方法是使用原子分组(?>X)和/或所有格限定符(*+)

According to the comments, you also prefer to capture only the "i" that is nearest to "bet" to reduce the length of the overall match. 根据评论,您还希望仅捕获最接近“ bet”的“ i”,以减少总体比赛的时间。 Since you want to get the closest 'i' to the rest of the words, then in the place where I added negative lookahead for word two, you would put also a negative lookahead for word one, right beside it. 由于您想获得与其余单词最接近的“ i”,因此在我为第二个单词添加否定前瞻的地方,您也要为它的第一个单词添加否定前瞻。 In other words, (?!bet) would become (?!i)(?!bet) or (?!i|bet) . 换句话说, (?!bet) (?!i)(?!bet)将变成(?!i)(?!bet)(?!i|bet) I have edited the code below to include this requirement. 我已编辑以下代码以包含此要求。

String fileContentsAsString = "ii ... bet ... you, ibetyouyou";
String regex = "i(?>(?!i|bet).)*+bet(?>(?!you).)*+you";
Pattern p = Pattern.compile(regex, Pattern.DOTALL);
Matcher m = p.matcher(fileContentsAsString);
while (m.find()) {
    System.out.println(m.group());
}

Output: 输出:

i .... bet .... you 我..赌..你

ibetyou ibetyou

Explanation (source) : 说明 (来源)

"The way a reluctant quantifier works is, each time it's supposed to try to match, it first tries to let the next part of the regex match instead. So it's effectively doing a lookahead at the beginning of each iteration, which can get pretty expensive, especially when the quantified part only matches one character per iteration, like .*?" “勉强的量词的工作方式是,每次都要尝试匹配时,它首先尝试让正则表达式的下一部分匹配。因此,它有效地在每次迭代的开始进行了前瞻,这可能会非常昂贵,尤其是当量化部分每次迭代仅匹配一个字符时,例如。*?”

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM