正则表达式：仍在捕获数据时后向性能？

Question

In my project in C# I am parsing text for dates. 在C＃项目中，我正在解析日期文本。 The dates can be in various formats, objective is to find and correct a number of date format errors. 日期可以采用多种格式，目的是查找并纠正许多日期格式错误。 Various date formats means a set of defined date formats isn't feasible. 各种日期格式意味着一组定义的日期格式不可行。 Originally I had a set of around 10 regexes applied one by one to the input string. 最初，我有一组大约10个正则表达式一个接一个地应用于输入字符串。 This was functionally fine but when the string got towards 200 KB of text, performance became a problem as the function took about 150 ms. 这在功能上很好，但是当字符串达到200 KB的文本时，由于该函数花费了大约150毫秒，因此性能成为问题。

I found I could improve performance considerably by applying the date regexes only to substrings that were dates. 我发现仅将日期正则表达式应用于仅是日期的子字符串，即可大大提高性能。 So if all dates had to have the English month name, using a regex of 因此，如果所有日期都必须有英文月份名称，请使用

\b(January|February|March|April|May|June|July|August|September|October|November|December)\b

would find them. 会找到他们。 If I then did some substringing to get the text around the month matched, overall function performance was about 25 ms, so much better. 如果我随后进行了一些子字符串化操作以使每个月的文本都匹配，则总体功能性能约为25毫秒，好得多。 However, the substring/loop, length check code is untidy and doesn't feel like a really good solution. 但是，子字符串/循环，长度检查代码不整洁，感觉不是一个很好的解决方案。 What I really wanted was a single regex to match the month and text around it, something like 我真正想要的是一个正则表达式来匹配月份和周围的文字，例如

.{0,25}\b(January|February|March|April|May|June|July|August|September|October|November|December)\b.{0,25}

is functionally fine. 在功能上很好。 However, performance of this regex is about 3500 ms to find matches on the same long input string. 但是，此正则表达式的性能约为3500毫秒，可在同一长输入字符串上查找匹配项。

Now the similar regex 现在类似的正则表达式

(?<=.{0,25})\b(January|February|March|April|May|June|July|August|September|October|November|December)\b.{0,25}

with a positive lookbehind finds the matches in about 15 ms (due very reduced backtracking, reasons I accept and have some understanding of). 往后看的人大约在15毫秒内找到了匹配项（由于回溯大大减少，我接受并有所了解的原因）。 However, that doesn't work for my use as I need the text before and after the month name to be included in the match result. 但是，这对我不起作用，因为我需要将月份名称前后的文本包含在匹配结果中。

So, my question is, can I have a regex that has the performance of using the lookbehind, but the functionality of providing all the text within the match result? 因此，我的问题是，我是否可以拥有具有使用后向性能的正则表达式，但是可以提供匹配结果中所有文本的功能？

Answer 1

The performance gain is an illusion. 性能提升是一种幻想。 Normally, something like .{0,25} will cause a lot of backtracking, which explains the poor performance that you're seeing. 通常， .{0,25}会导致大量回溯，这说明了您看到的性能不佳。 When placed inside a look-behind, however, it stops behaving greedily and backtracking, the look-behind will look for the smallest possible match, which means that 0 characters will be tried, with no backtracking. 但是，当放置在后面的内部时，它将停止贪婪的行为和回溯，后面的寻找将寻找可能的最小匹配项，这意味着将尝试使用0个字符，而不会发生回溯。 This means that the look-behind is completely useless since it will always match zero characters before the month name. 这意味着，后向查找完全没有用处，因为它总是匹配月份名称前的零个字符。

What if you extract the context after you find a match on the month name by using the position of the match? 如果使用匹配位置在月份名称上找到匹配项之后提取上下文，该怎么办？ For each match in regex.Matches(str) , get match.Index and match.Length and substring before and after those positions. 对于regex.Matches(str)每个match regex.Matches(str) ，在这些位置之前和之后获取match.Index和match.Length以及子字符串。

正则表达式：仍在捕获数据时后向性能？

问题描述

1 个解决方案

解决方案1
1 2013-01-05 10:20:17

正则表达式：仍在捕获数据时后向性能？

问题描述

1 个解决方案

解决方案1 1 2013-01-05 10:20:17

解决方案1
1 2013-01-05 10:20:17