简体   繁体   English

正则表达式:仍在捕获数据时后向性能?

[英]regex: performance of lookbehind while still capturing data?

In my project in C# I am parsing text for dates. 在C#项目中,我正在解析日期文本。 The dates can be in various formats, objective is to find and correct a number of date format errors. 日期可以采用多种格式,目的是查找并纠正许多日期格式错误。 Various date formats means a set of defined date formats isn't feasible. 各种日期格式意味着一组定义的日期格式不可行。 Originally I had a set of around 10 regexes applied one by one to the input string. 最初,我有一组大约10个正则表达式一个接一个地应用于输入字符串。 This was functionally fine but when the string got towards 200 KB of text, performance became a problem as the function took about 150 ms. 这在功能上很好,但是当字符串达到200 KB的文本时,由于该函数花费了大约150毫秒,因此性能成为问题。

I found I could improve performance considerably by applying the date regexes only to substrings that were dates. 我发现仅将日期正则表达式应用于仅是日期的子字符串,即可大大提高性能。 So if all dates had to have the English month name, using a regex of 因此,如果所有日期都必须有英文月份名称,请使用

\b(January|February|March|April|May|June|July|August|September|October|November|December)\b

would find them. 会找到他们。 If I then did some substringing to get the text around the month matched, overall function performance was about 25 ms, so much better. 如果我随后进行了一些子字符串化操作以使每个月的文本都匹配,则总体功能性能约为25毫秒,好得多。 However, the substring/loop, length check code is untidy and doesn't feel like a really good solution. 但是,子字符串/循环,长度检查代码不整洁,感觉不是一个很好的解决方案。 What I really wanted was a single regex to match the month and text around it, something like 我真正想要的是一个正则表达式来匹配月份和周围的文字,例如

.{0,25}\b(January|February|March|April|May|June|July|August|September|October|November|December)\b.{0,25}

is functionally fine. 在功能上很好。 However, performance of this regex is about 3500 ms to find matches on the same long input string. 但是,此正则表达式的性能约为3500毫秒,可在同一长输入字符串上查找匹配项。

Now the similar regex 现在类似的正则表达式

(?<=.{0,25})\b(January|February|March|April|May|June|July|August|September|October|November|December)\b.{0,25}

with a positive lookbehind finds the matches in about 15 ms (due very reduced backtracking, reasons I accept and have some understanding of). 往后看的人大约在15毫秒内找到了匹配项(由于回溯大大减少,我接受并有所了解的原因)。 However, that doesn't work for my use as I need the text before and after the month name to be included in the match result. 但是,这对我不起作用,因为我需要将月份名称前后的文本包含在匹配结果中。

So, my question is, can I have a regex that has the performance of using the lookbehind, but the functionality of providing all the text within the match result? 因此,我的问题是,我是否可以拥有具有使用后向性能的正则表达式,但是可以提供匹配结果中所有文本的功能?

The performance gain is an illusion. 性能提升是一种幻想。 Normally, something like .{0,25} will cause a lot of backtracking, which explains the poor performance that you're seeing. 通常, .{0,25}会导致大量回溯,这说明了您看到的性能不佳。 When placed inside a look-behind, however, it stops behaving greedily and backtracking, the look-behind will look for the smallest possible match, which means that 0 characters will be tried, with no backtracking. 但是,当放置在后面的内部时,它将停止贪婪的行为和回溯,后面的寻找将寻找可能的最小匹配项,这意味着将尝试使用0个字符,而不会发生回溯。 This means that the look-behind is completely useless since it will always match zero characters before the month name. 这意味着,后向查找完全没有用处,因为它总是匹配月份名称前的零个字符。

What if you extract the context after you find a match on the month name by using the position of the match? 如果使用匹配位置在月份名称上找到匹配项之后提取上下文,该怎么办? For each match in regex.Matches(str) , get match.Index and match.Length and substring before and after those positions. 对于regex.Matches(str)每个match regex.Matches(str) ,在这些位置之前和之后获取match.Indexmatch.Length以及子字符串。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM