简体   繁体   English

Java,带有惰性表达式的正则表达式性能较差

[英]Java, poor regex performance with lazy expressions

The code is actually in Scala (Spark/Scala) but the library scala.util.matching.Regex, as per the documentation, delegates to java.util.regex. 该代码实际上在Scala(Spark / Scala)中,但是根据文档,库scala.util.matching.Regex委托给java.util.regex。

The code, essentially, reads a bunch of regex from a config file and then matches them against logs fed to the Spark/Scala app. 从本质上讲,该代码从配置文件中读取一堆正则表达式,然后将它们与馈入Spark / Scala应用程序的日志进行匹配。 Everything worked fine until I added a regex to extract strings separated by tabs where the tab has been flattened to "#011" (by rsyslog). 一切正常,直到我添加了一个正则表达式以提取由制表符分隔的字符串,其中制表符被展平为“#011”(通过rsyslog)。 Since the strings can have white-spaces, my regex looks like: (.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?) 由于字符串可以有空格,因此我的正则表达式看起来像: (.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)

The moment I add this regex to the list, the app takes forever to finish processing logs. 当我将此正则表达式添加到列表中时,该应用程序将花费大量时间来完成日志处理。 To give you an idea of the magnitude of delay, a typical batch of a million lines takes less than 5 seconds to match/extract on my Spark cluster. 为了让您大致了解延迟的大小,一百万行的典型批次用不到5秒的时间就能匹配或提取我的Spark集群。 If I add the expression above, a batch takes an hour! 如果我在上面添加表达式,则批处理需要一个小时!

In my code, I have tried a couple of ways to match regex: 在我的代码中,我尝试了几种匹配正则表达式的方法:

  1. if ( (regex findFirstIn log).nonEmpty ) { do something }

  2. val allGroups = regex.findAllIn(log).matchData.toList if (allGroups.nonEmpty) { do something }

  3. if (regex.pattern.matcher(log).matches()){do something}

All three suffer from poor performance when the regex mentioned above it added to the list of regex. 当上面提到的正则表达式添加到正则表达式列表中时,这三者均会表现不佳。 Any suggestions to improve regex performance or change the regex itself? 对提高正则表达式性能或更改正则表达式本身有任何建议吗?

The Q/A that's marked as duplicate has a link that I find hard to follow. 标记为重复的Q / A具有一个我很难遵循的链接。 It might be easier to follow the text if the referenced software, regexbuddy, was free or at least worked on Mac. 如果所引用的软件regexbuddy是免费的,或者至少在Mac上可以运行,则遵循该文字可能会更容易。

I tried negative lookahead but I can't figure out how to negate a string. 我尝试了负向查找,但是我不知道如何对字符串取反。 Instead of /(.+?)#011/ , something like /([^#011]+)/ but that just says negate "#" or "0" or "1". 代替/(.+?)#011/ ,类似/([^#011]+)/但只说否定“#”或“ 0”或“ 1”。 How do I negate "#011"? 如何取消“#011”? Even after that, I am not sure if negation will fix my performance issue. 即使在那之后,我不确定否定是否可以解决我的性能问题。

The simplest way would be to split on #011 . 最简单的方法是在#011上拆分。 If you want a regex, you can indeed negate the string, but that's complicated. 如果要使用正则表达式,则确实可以取反字符串,但这很复杂。 I'd go for an atomic group 我会去一个原子团

(?>(.+?)#011)

Once matched, there's no more backtracking. 一旦匹配,就不再有回溯。 Done and looking forward for the next group. 做完了,期待下一组。

Negating a string 取反字符串

The complement of #011 is anything not starting with a # , or starting with a # and not followed by a 0 , or starting with the two and not followed... you know. 的补#011是什么也没开始用# ,或开始用#和后面没有0 ,或与两个,而不是跟着开始......你知道的。 I added some blanks for readability: 我添加了一些空白以提高可读性:

 ((?: [^#] | #[^0] | #0[^1] | #01[^1] )+) #011

Pretty terrible, isn't it? 太可怕了,不是吗? Unlike your original expression it matches newlines (you weren't specific about them). 与您的原始表达式不同,它匹配换行符(您并没有具体说明它们)。

An alternative is to use negative lookahead: (?!#011) matches iff the following chars are not #011 , but doesn't eat anything, so we use a . 一种替代方法是使用否定的超前查询: (?!#011)匹配,前提是以下字符不是#011 ,但不吃任何东西,因此我们使用. to eat a single char: 吃一个char:

 ((?: (?!#011). )+)#011

It's all pretty complicated and most probably less performant than simply using the atomic group. 与仅使用原子组相比,这一切都非常复杂,并且性能可能较差。

Optimizations 优化

Out of my above regexes, the first one is best. 在我上面的正则表达式中,第一个是最好的。 However, as Casimir et Hippolyte wrote, there's a room for improvements (factor 1.8) 但是,正如Casimir et Hippolyte所写,还有改进的余地(系数1.8)

( [^#]*+ (?: #(?!011) [^#]* )*+ ) #011

It's not as complicated as it looks. 它并不像看起来那么复杂。 First match any number (including zero) of non- # atomically (the trailing + ). 首先以原子方式匹配任意数量(包括零)的非#字符(后跟+ )。 Then match a # not followed by 011 and again any number of non- # . 然后匹配一个不跟#再匹配任意数量的非# Repeat the last sentence any number of times. 重复最后一句话任意次数。

A small problem with it is that it matches an empty sequence as well and I can't see an easy way to fix it. 它的一个小问题是它也匹配一个空序列,我看不到一种简单的方法来修复它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM