简体   繁体   English

如何提高.NET正则表达式的性能?

[英]How can I improve the performance of a .NET regular expression?

I have a regular expression which parses a (very small) subset of the Razor template language. 我有一个正则表达式,它解析Razor模板语言的(很小)子集。 Recently, I added a few more rules to the regex which dramatically slowed its execution. 最近,我在正则表达式中添加了更多规则,从而大大减慢了其执行速度。 I'm wondering: are there certain regex constructs that are known to be slow? 我想知道:是否有某些已知的正则表达式构造很慢? Is there a restructuring of the pattern I'm using that would maintain readability and yet improve performance? 是否对我使用的模式进行了重组,以保持可读性并提高性能? Note: I've confirmed that this performance hit occurs post-compilation. 注意:我已经确认此性能下降是在编译后发生的。

Here's the pattern: 这是模式:

new Regex(
              @"  (?<escape> \@\@ )"
            + @"| (?<comment> \@\* ( ([^\*]\@) | (\*[^\@]) | . )* \*\@ )"
            + @"| (?<using> \@using \s+ (?<namespace> [\w\.]+ ) (\s*;)? )"

            // captures expressions of the form "foreach (var [var] in [expression]) { <text>" 
/* ---> */      + @"| (?<foreach> \@foreach \s* \( \s* var \s+ (?<var> \w+ ) \s+ in \s+ (?<expressionValue> [\w\.]+ ) \s* \) \s* \{ \s* <text> )"

            // captures expressions of the form "if ([expression]) { <text>" 
/* ---> */      + @"| (?<if> \@if \s* \( \s* (?<expressionValue> [\w\.]+ ) \s* \) \s* \{ \s* <text> )"  

            // captures the close of a razor text block
            + @"| (?<endBlock> </text> \s* \} )"

            // an expression of the form @([(int)] a.b.c)
            + @"| (?<parenAtExpression> \@\( \s* (?<castToInt> \(int\)\s* )? (?<expressionValue> [\w\.]+ ) \s* \) )"
            + @"| (?<atExpression> \@ (?<expressionValue> [\w\.]+ ) )"
/* ---> */      + @"| (?<literal> ([^\@<]+|[^\@]) )",
            RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline | RegexOptions.ExplicitCapture | RegexOptions.Compiled);

/* ---> */ indicates the new "rules" that caused the slowdown. / * ---> * /表示导致变慢的新“规则”。

As you are not anchoring the expression the engine will have to check each alternative sub-pattern at every position of the string before it can be sure that it can't find a match. 由于您没有锚定表达式,因此引擎必须先在字符串的每个位置检查每个替代子模式,然后才能确保找不到匹配项。 This will always be time-consuming, but how can it be made less so? 这总是很耗时的,但是又怎么可以减少它呢?

Some thoughts: 一些想法:

I don't like the sub-pattern on the second line that tries to match comments and I don't think it will work correctly. 我不喜欢第二行尝试匹配注释的子模式,并且我认为它不能正常工作。

I can see what you're trying to do with the ( ([^\\*]\\@) | (\\*[^\\@]) | . )* - allow @ and * within the comments as long as they are not preceded by * or followed by @ respectively. 我可以看到您正在尝试使用( ([^\\*]\\@) | (\\*[^\\@]) | . )* -在注释中允许@* ,只要它们不分别以*@开头。 But because of the group's * quantifier and the third option . 但是由于该组的*量词和第三个选项. , the sub-pattern will happily match *@ , therefore rendering the other options redundant. ,子模式会很高兴地匹配*@ ,因此使其他选项变得多余。

And assuming that the subset of Razor you are trying to match does not allow multiline comments, I suggest for the second line 并假设您要匹配的Razor子集不允许多行注释,我建议第二行

+ @"| (?<comment> @\*.*?\*@ )"

ie lazily match any characters (but newlines) until the first *@ is encountered. 即懒惰地匹配任何字符(但换行符),直到遇到第一个*@ You are using RegexOptions.ExplicitCapture meaning only named groups are being captured, so the lack of () should not be a problem. 您正在使用RegexOptions.ExplicitCapture这意味着仅捕获命名的组,因此缺少()应该不会成为问题。

I also do not like the ([^\\@<]+|[^\\@]) sub-pattern in the last line, which equates to ([^\\@<]+|<) . 我也不喜欢最后一行中的([^\\@<]+|[^\\@])子模式,它等于([^\\@<]+|<) The [^\\@<]+ will greedily match to the end of the string unless it comes across a @ or < . 除非[^\\@<]+碰到@<否则它将贪婪地匹配到字符串的末尾。

I do not see any adjacent sub-patterns that will match the same text, which are the usual culprits for excessive backtracking, but all the \\s* seem suspect because of their greed and flexibility, including matching nothing and newlines. 我看不到任何与同一文本匹配的相邻子模式,它们是过度回溯的常见原因,但是所有\\s*似乎都是可疑的,因为它们的贪婪和灵活性,包括什么都不匹配和换行符。 Perhaps you could change some of the \\s* to [ \\t]* where you know you don't want to match newlines, for example, perhaps before the opening bracket following an if . 也许您可以将某些\\s*更改为[ \\t]* ,例如,您知道不想与换行符匹配,例如,可能在if后面的左括号之前。

I notice that nhahtdh has suggested you use use atomic grouping to prevent the engine backtracking into the previously matched, and that is certainly something worth experimenting with as it is almost certainly the excessive backtracking caused when the engine can no longer find a match that is causing the slow-down. 我注意到nhahtdh建议您使用原子分组来防止引擎回溯到以前的匹配项中,这当然值得一试,因为几乎可以肯定的是,当引擎无法找到导致该匹配项的情况时,会产生过多的回溯减速。

What are you trying to achieve with the RegexOptions.Multiline option? 您想使用RegexOptions.Multiline选项实现什么? You do not look to be using ^ or $ so it will have no effect. 您似乎并没有使用^$所以它不会起作用。

The escaping of the @ is unnecessary. @的转义是不必要的。

As others have mentioned, you can improve the readability by removing unnecessary escapes (such as escaping @ or escaping characters aside from \\ inside a character class; for example, using [^*] instead of [^\\*] ). 如其他人所提到的,您可以通过删除不必要的转义符来提高可读性(例如,在字符类中转义@或转义\\以外的字符;例如,使用[^*]而不是[^\\*] )。

Here are some ideas for improving performance: 以下是一些改善性能的想法:

Order your different alternatives so that the most likely ones come first. 订购不同的替代品,以便最可能的替代品优先出现。

The regex engine will attempt to match each alternative in the order that they appear in the regex. 正则表达式引擎将尝试按它们在正则表达式中出现的顺序匹配每个替代项。 If you put the ones that are more likely up front, then the engine will not have to waste time attempting to match against unlikely alternatives for the majority of cases. 如果您将更可能的发动机放在前面,那么在大多数情况下,发动机将不必浪费时间尝试与不太可能的发动机进行匹配。

Remove unnecessary backtracking 删除不必要的回溯

Not the ending of your "using" alternative: @"| (?<using> \\@using \\s+ (?<namespace> [\\w\\.]+ ) (\\s*;)? )" 不是您的“使用”替代项的结尾: @"| (?<using> \\@using \\s+ (?<namespace> [\\w\\.]+ ) (\\s*;)? )"

If for some reason you have a large amount of whitespace, but no closing ; 如果由于某种原因,您有大量的空格,但是没有关闭; at the end of a using line, the regex engine must backtrack through each whitespace character until it finally decides that it can't match (\\s*;) . 在使用行的末尾,正则表达式引擎必须回溯每个空白字符,直到最终确定它不匹配(\\s*;)为止。 In your case, (\\s*;)? 就您而言, (\\s*;)? can be replaced with \\s*;? 可以替换为\\s*;? to prevent backtracking in these scenarios. 以防止在这些情况下回溯。

In addition, you could use atomic groups (?> ... ) to prevent backtracking through quantifiers (eg * and + ). 另外,您可以使用原子组(?> ... )防止通过量词(例如*+ )回溯。 This really helps improve performance when you don't find a match. 当您找不到匹配项时,这确实有助于提高性能。 For example, your "foreach" alternative contains \\s* \\( \\s* . If you find the text "foreach var..." , the "foreach" alternative will greedily match all of the whitespace after foreach , and then fail when it doesn't find an opening ( . It will then backtrack, one whitespace-character at a time, and try to match ( at the previous position until it confirms that it cannot match that line. Using an atomic group (?>\\s*)\\( will cause the regex engine to not backtrack through \\s* if it matches, allowing the regex to fail more quickly. 例如,您的“ foreach”替代包含\\s* \\( \\s* 。如果找到文本"foreach var..." ,则“ foreach”替代将贪婪地匹配foreach之后的所有空白,然后在它不会找到空缺( 。它将一次回溯一个空白字符,并尝试匹配(在先前的位置,直到确认它不能匹配该行。使用原子组(?>\\s*)\\(如果匹配,将导致正则表达式引擎不回溯\\ s *,从而使正则表达式更快地失败。

Be careful when using them though, as they can cause unintended failures when used at the wrong place (for instance, '(?>,*); will never match anything, due to the greedy .* matching all characters (including ; ), and the atomic grouping (?> ... ) preventing the regex engine from backtracking one character to match the ending ; ). 不过,使用它们时要小心,因为在错误的地方使用它们会导致意外的失败(例如'(?>,*);由于贪婪的.*匹配所有字符(包括; ),因此永远不会匹配任何内容,和原子分组(?> ... )防止正则表达式引擎回溯一个字符以匹配结尾; )。

"Unroll the loop" on some of your alternatives, such as your "comment" alternative (also useful if you plan on adding an alternative for strings). 在某些替代项上“展开循环”,例如“注释”替代项(如果计划为字符串添加替代项,则也很有用)。

For example: @"| (?<comment> \\@\\* ( ([^\\*]\\@) | (\\*[^\\@]) | . )* \\*\\@ )" 例如: @"| (?<comment> \\@\\* ( ([^\\*]\\@) | (\\*[^\\@]) | . )* \\*\\@ )"

Could be replaced with @"| (?<comment> @\\* [^*]* (\\*+[^@][^*]*)* \\*+@ )" 可以替换为@"| (?<comment> @\\* [^*]* (\\*+[^@][^*]*)* \\*+@ )"

The new regex boils down to: 新的正则表达式可以归结为:

  1. @\\* : Find the beginning of a comment @* @\\* :找到评论的开头@*
  2. [^*]* : Read all "normal characters" (anything that's not a * because that could signify the end of the comment) [^*]* :阅读所有“普通字符”(不是*任何内容,因为这可能表示注释的结尾)
  3. (\\*+[^@][^*]*)* : include any non-terminal * inside the comment (\\*+[^@][^*]*)* :在注释中包含任何非终结符*
    • (\\*+[^@] : If we find a * , ensure that any string of * s doesn't end in a @ (\\*+[^@] :如果找到* ,请确保* s的任何字符串都不以@结尾
    • [^*]* : Go back to reading all "normal characters" [^*]* :返回阅读所有“普通字符”
    • )* : Loop back to the beginning if we find another * )* :如果找到另一个*则循环回到开头
  4. \\*+@ : Finally, grab the end of the comment *@ being careful to include any extra * \\*+@ :最后,抓住评论的结尾*@小心添加任何多余的*

You can find many more ideas for improving the performance of your regular expressions from Jeffrey Friedl's Mastering Regular Expressions (3rd Edition) . 您可以从Jeffrey Friedl的Mastering Regular Expressions(第3版)中找到更多有关提高正则表达式性能的想法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM