简体   繁体   English

为什么这个正则表达式需要很长时间才能执行?

[英]Why does this regex take a long time to execute?

I found out, that for example this line has a very very long execution time: 我发现,例如这条线的执行时间非常长:

System.out.println(
        ".. .. .. .. .. .. .. .. ..  .. .. .. .. .. .. .. .. .. .. .. .... .. .."
        .matches("(?i)(?:.* )?\\W?([a-z0-9-_\\.]+((?: *)\\.(?: *))+(?:DE))(?:[0-9]{1,5})?")
);

If I reduce the amount of dots at the start of the String the execution time gets lower (seems like it's exponential). 如果我减少字符串开头的点数,则执行时间会降低(看起来像是指数)。 Here is the suspended thread's stack trace: 这是挂起线程的堆栈跟踪:

[Repeating text]...
Pattern$GroupTail.match(Matcher, int, CharSequence) line: 4717
Pattern$Curly.match0(Matcher, int, int, CharSequence) line: 4279
Pattern$Curly.match(Matcher, int, CharSequence) line: 4234
Pattern$GroupHead.match(Matcher, int, CharSequence) line: 4658
Pattern$GroupHead.match(Matcher, int, CharSequence) line: 4658
Pattern$Loop.match(Matcher, int, CharSequence) line: 4785
Pattern$GroupTail.match(Matcher, int, CharSequence) line: 4717
Pattern$GroupTail.match(Matcher, int, CharSequence) line: 4717
Pattern$Curly.match0(Matcher, int, int, CharSequence) line: 4279
Pattern$Curly.match(Matcher, int, CharSequence) line: 4234
Pattern$GroupHead.match(Matcher, int, CharSequence) line: 4658
Pattern$Single(Pattern$BmpCharProperty).match(Matcher, int, CharSequence) line: 3798
Pattern$GroupTail.match(Matcher, int, CharSequence) line: 4717
Pattern$Curly.match0(Matcher, int, int, CharSequence) line: 4272
Pattern$Curly.match(Matcher, int, CharSequence) line: 4234
Pattern$GroupHead.match(Matcher, int, CharSequence) line: 4658
Pattern$GroupHead.match(Matcher, int, CharSequence) line: 4658
Pattern$Loop.match(Matcher, int, CharSequence) line: 4785
Pattern$GroupTail.match(Matcher, int, CharSequence) line: 4717
Pattern$GroupTail.match(Matcher, int, CharSequence) line: 4717
Pattern$Curly.match0(Matcher, int, int, CharSequence) line: 4272
Pattern$Curly.match(Matcher, int, CharSequence) line: 4234
Pattern$GroupHead.match(Matcher, int, CharSequence) line: 4658
Pattern$Single(Pattern$BmpCharProperty).match(Matcher, int, CharSequence) line: 3798
Pattern$GroupTail.match(Matcher, int, CharSequence) line: 4717
Pattern$Curly.match0(Matcher, int, int, CharSequence) line: 4279
Pattern$Curly.match(Matcher, int, CharSequence) line: 4234
Pattern$GroupHead.match(Matcher, int, CharSequence) line: 4658
Pattern$GroupHead.match(Matcher, int, CharSequence) line: 4658
Pattern$Loop.matchInit(Matcher, int, CharSequence) line: 4801
Pattern$Prolog.match(Matcher, int, CharSequence) line: 4741
Pattern$Curly.match0(Matcher, int, int, CharSequence) line: 4272
Pattern$Curly.match(Matcher, int, CharSequence) line: 4234
Pattern$GroupHead.match(Matcher, int, CharSequence) line: 4658
Pattern$Ques.match(Matcher, int, CharSequence) line: 4182
Pattern$BranchConn.match(Matcher, int, CharSequence) line: 4568
Pattern$GroupTail.match(Matcher, int, CharSequence) line: 4717
Pattern$Single(Pattern$BmpCharProperty).match(Matcher, int, CharSequence) line: 3798
Pattern$Curly.match0(Matcher, int, int, CharSequence) line: 4272
Pattern$Curly.match(Matcher, int, CharSequence) line: 4234
Pattern$GroupHead.match(Matcher, int, CharSequence) line: 4658
Pattern$Branch.match(Matcher, int, CharSequence) line: 4604
Matcher.match(int, int) line: 1270
Matcher.matches() line: 604
Pattern.matches(String, CharSequence) line: 1135
String.matches(String) line: 2121
Main.main(String[]) line: 11

Why does this happen? 为什么会这样?

When pattern x is made optional - using ? 当模式x可选时 - 使用? or * quantifiers (or {0,} ) - engine has two paths to approach according to the nature of quantifier being used: *量词(或{0,} ) - 根据所使用的量词的性质,引擎有两条路径:

  • Consumes then backtracks for other patterns (case of greediness ie .* , .? ) 消费然后回溯其他模式(贪婪的情况,即.* , .?
  • First doesn't consume and looks immediately for other patterns (case of laziness .*? ) 首先不消耗并立即查看其他模式(懒惰的情况.*?

Someone probably is not aware about regular expressions or doesn't care about performance and throws .* wherever he needs a match somewhere in string and engines are so fast in taking steps back and forth that nothing seems weird or slow unless a pattern can not be found. 有人可能不知道正则表达式或者不关心性能和抛出.*无论何时他需要在字符串中的某个地方进行匹配而且引擎来回如此之快以至于除非模式不能成为任何模式之外没有任何看似奇怪或缓慢找到。

Time complexity starts at O(n) and continues with O(n^2b) where b is level of nesting quantifiers. 时间复杂度从O(n)并继续O(n^2b) ,其中b是嵌套量词的级别。 So on failure number of steps an engine takes is HUGE. 因此,在故障时,引擎所采用的步数是巨大的。

To avoid such situations someone needs to consider some guiding principles: 为了避免这种情况,有人需要考虑一些指导原则:

  • Specifying boundaries. 指定边界。 If pattern should stop somewhere before digits do not do .* . 如果模式应该在数字之前停止某处.* Instead do \\D* . 而是做\\D*

  • Use conditions. 使用条件。 You can check if pattern / letter x exists before running a whole match using a lookahead ^(?=[^x]*x) . 在使用前瞻^(?=[^x]*x)运行整个匹配之前,您可以检查模式/字母x存在。 This leads to an early failure. 这导致早期失败。

  • Use possessive quantifiers or atomic groups (if available). 使用所有格量词或原子组(如果可用)。 These two avoid backtracks. 这两个避免回溯。 Sometimes you do not need backtracks. 有时你不需要回溯。

  • Do not do (.*)+ or similar patterns. 不要做(.*)+或类似的模式。 Instead reconsider your requirements or at least use atomic groups (?>.*)+ . 而是重新考虑您的要求或至少使用原子组(?>.*)+

Your own Regular Expression isn't an exception. 你自己的正则表达式也不例外。 It suffers from much greediness and optional matches and needs a time to be restudied. 它有很多贪婪和可选的比赛,需要一段时间来重新训练。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 为什么这个同步块似乎需要很长时间才能锁定? - Why does it seem to take a long time for this synchronized block to get a lock? 为什么这种模式在java中需要很长时间才能匹配? - Why does this pattern take a long time to match in java? 为什么我的 TCP 插座需要很长时间才能读取响应? - Why does my TCP socket take a long time to read a response? 为什么检查HashMap是否具有某个值需要很长时间才能在for循环中执行? - Why does checking if a HashMap has a certain value take very long to execute within a for loop? 执行一个循环需要多少时间? - How much time does it take to execute a loop? 为什么此循环过程需要这么长时间? - Why does this looping process take so long? 为什么这需要这么长时间才能运行? - Why does this take so long to run? 为什么ForkJoinPool.commonPool()。execute(runnable)需要更多时间来运行线程 - Why does ForkJoinPool.commonPool().execute(runnable) take more time to run the thread 为什么处理器需要花费相同的时间来执行这两个for循环(用Java编码)? - Why does the processor take the same amount of time to execute these two for loops (coded in Java)? 为什么setFont需要那么多时间? - why does setFont take so much time?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM