简体   繁体   English

使用正则表达式解析C样式注释,避免回溯

[英]Parse C-Style Comments with Regex, avoid Backtracking

I want to match all block and multiline comments in a JavaScript file (these are C-Style comments). 我想匹配JavaScript文件中的所有块和多行注释(这些是C样式注释)。 I have a pattern that works well. 我有一个运作良好的模式。 However, it creates some backtracking which slows it down significantly, especially on larger files. 但是,它会创建一些回溯速度,从而显着降低速度,尤其是在较大的文件上。

Pattern: \\/\\*(?:.|[\\r\\n])*?\\*\\/|(?:\\/\\/.*) 模式: \\/\\*(?:.|[\\r\\n])*?\\*\\/|(?:\\/\\/.*)

Example: https://www.regex101.com/r/pR6eH6/2 示例: https//www.regex101.com/r/pR6eH6/2

How can I avoid the backtracking? 我怎样才能避免回溯?

You have heavy backtracking because of the alternation. 由于交替,你有很大的回溯。 Instead of the (?:.|[\\r\\n]) , you may consider using a character class [\\s\\S] that boosts performance to a noticeable extent: 您可以考虑使用一个可以显着提升性能的字符类[\\s\\S] ,而不是(?:.|[\\r\\n])

\/\*[\s\S]*?\*\/|\/\/.*

See demo 演示

In Python, you can use the re.S / re.DOTALL modifier to make . 在Python中,您可以使用re.S / re.DOTALL修饰符来进行. match line breaks, too (note that the single line comment pattern should be matched with \\/\\/[^\\r\\n]* then): 匹配换行符(注意单行注释模式应与\\/\\/[^\\r\\n]*匹配):

/\*.*?\*/|//[^\r\n]*

See another demo 另一个演示

However , since *? 但是 ,自*? lazy quantifier will also cause an overhead similar to the one caused by greedy quantifiers, you should consider using a much more optimal pattern for C style multiline comments - /\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/ , and the whole regex will now look like: 延迟量词也会导致类似于贪心量词引起的开销,你应该考虑为C风格多行注释使用更优化的模式 - /\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/ ,整个正则表达式现在看起来像:

/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//.*

See yet another demo 再看看另一个演示

Details : 细节

  • /\\* - a /* /\\* - a /*
  • [^*]* - zero or more chars other than * [^*]* -比其他零个或多个字符*
  • \\*+ - one or more asterisks \\*+ - 一个或多个星号
  • (?:[^/*][^*]*\\*+)* - zero or more sequences of: (?:[^/*][^*]*\\*+)* - 零个或多个序列:
    • [^/*] - a symbol other than / and * [^/*] - 除/*之外的符号
    • [^*]* - zero or more symbols other than * [^*]* - 零以外的符号*
    • \\*+ - 1+ asterisks \\*+ - 1+星号
  • / - a / symbol / - 一个/符号
  • | - or - 要么
  • //.* - // and any 0+ chars other than than line break chars. //.* - //以及除了换行符之外的任何0+字符。

Just wanted to note that in Python, you do not need to escape / (in JS, you do not need to escape / when declaring a regex using the RegExp constuctor). 只是想注意,在Python中,你不需要转义/ (在JS中,你不需要转义/使用RegExp构造器声明正则表达式时)。

NOTE : The last pattern does not allow simple capturing what is inside /* and */ , but since the pattern is more stable than the rest, I'd advise using it even when you need to capture the contents with the trailing * - /\\*([^*]*\\*+(?:[^/*][^*]*\\*+)*)/|//(.*) - and then you'd need to remove the last char from .group(1) . 注意 :最后一个模式不允许简单捕获/**/ ,但由于模式比其他模式更稳定,我建议使用它,即使你需要捕获尾随的内容* - /\\*([^*]*\\*+(?:[^/*][^*]*\\*+)*)/|//(.*) - 然后你需要删除最后一个字符来自.group(1)

What can you do with your pattern? 你能用你的模式做什么?

Your actual pattern is: 你的实际模式是:

 \/\*(?:.|[\r\n])*?\*\/|(?:\/\/.*)

or without useless backslashes and groups: 或者没有无用的反斜杠和组:

/\*(?:.|[\r\n])*?\*/|//.*

As stribizhev has explained (?:.|[^\\r\\n])*? 正如stribizhev所解释的那样(?:.|[^\\r\\n])*? can be written in a more simple way using the DOTALL mode, ie: .*? 可以使用DOTALL模式以更简单的方式编写,即: .*? or without using [\\s\\S] in place of the dot. 或者不使用[\\s\\S]代替点。

But you can do much better if you put in factor the first character / that is in common for the two branches of your main alternation (the branch for multiline comments and the branch for singleline comments): 但是你可以做,如果你把在因素的第一个字符好得多/这是常见的主交替(对于多行注释分支和单行注释的分支)的两个分支:

/(?:\*[\s\S]*?\*/|/.*)

The two advantages of this change: 这种变化的两个优点:

  1. Beginning a pattern with an alternation is not a good idea and must be avoided when possible, because the regex engine must test the two branches of the alternation (in the worst case) for each position in the string. 开始一个带有交替的模式并不是一个好主意,必须尽可能避免,因为正则表达式引擎必须测试字符串中每个位置的交替的两个分支(在最坏的情况下)。 So in your case (only two branches), you can consider that the regex engine work is X2. 所以在你的情况下(只有两个分支),你可以认为正则表达式引擎工作是X2。 If you put the first character (or more tokens if possible) in factor, the greatest part of uninteresting positions in the string are more quickly discarded (positions that doesn't start with a / ), since there is only one branch to test when the first character is not the good one. 如果你把第一个字符(或者更多的标记,如果可能的话)放在因子中,字符串中不感兴趣的位置的最大部分被更快地丢弃(不以/开头的位置),因为只有一个分支要测试第一个角色不是好角色。

  2. When you start a pattern with a literal string, the regex engine is able to use a faster algorithm to directly find positions in the string where the pattern may succeed (the positions where the literal string appears). 当您使用文字字符串启动模式时,正则表达式引擎能够使用更快的算法直接查找模式可能成功的字符串中的位置(文字字符串出现的位置)。 In your case, using this optimisation will make your pattern much faster. 在您的情况下,使用此优化将使您的模式更快。

Other thing you can improve: the non-greedy quantifier 你可以改进的其他事情: 非贪婪的量词

A non-greedy quantifier is slow by nature (compared to a greedy quantifier) because each time it take a character, it must test if the end of the pattern succeeds or not (until the end of the pattern succeeds). 非贪婪量词本质上是缓慢的(与贪心量词相比),因为每次取一个字符时,它必须测试模式的结尾是否成功(直到模式结束成功)。 In other words, a non-greedy quantifier can be worst than a greedy quantifier when the backtracking mechanism occurs (the backtracking mechanism and how quantifiers work is one of the more (the most?) important thing to understand, take the time for that). 换句话说,当回溯机制发生时,非贪婪的量词可能比贪婪的量词更糟糕(回溯机制以及量词的工作方式是更重要的(最重要的)重要事项之一,花时间去做) 。

You can rewrite the subpattern \\*[\\s\\S]*?\\*/ in a more efficient way: 您可以以更有效的方式重写子模式\\*[\\s\\S]*?\\*/

\*[^*]*\*+(?:[^*/][^*]*\*+)*/

details: 细节:

\*    # literal asterisk
[^*]* # zero or more character that are not an asterisk
\*+   # one or more asterisks: this one will match either the last asterisk(s)
      # before the closing slash or asterisk(s) inside the comment.

(?:[^*/][^*]*\*+)* # In case there are asterisks(s) inside the comment, this
                   # optional group ensures the next character isn't a slash: [^*/]
                   # and reach the next asterisk(s): [^*]*\*+
/    # a literal slash

This subpattern is more longer but more efficient since it uses only greedy quantifiers, and has backtracking steps reduced to the minimum. 这个子模式更长,但效率更高,因为它只使用贪婪的量词,并且将回溯步骤减少到最小。

The pattern is now: 现在的模式是:

/(?:\*[^*]*\*+(?:[^*/][^*]*\*+)*/|/.*)

and only needs ~950 steps (instead of ~12500) to find the 63 occurrences of your example string. 并且只需要~950步(而不是~12500)来查找示例字符串的63次出现。

demo 演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM