简体   繁体   English

在 c 样式注释中读取 Python 正则表达式

[英]Python Regex reading in c style comments

Im trying to find c style comments in ac file but im having trouble if there happens to be // inside of quotations.我试图在 ac 文件中找到 c 样式的注释,但如果恰好在引号内,我会遇到麻烦。 This is the file:这是文件:

/*My function
is great.*/
int j = 0//hello world
void foo(){
    //tricky example
    cout << "This // is // not a comment\n";
}

it will match with that cout.它将与该 cout 匹配。 This is what i have so far (i can match the /**/ comments already)这就是我到目前为止所拥有的(我已经可以匹配 /**/ 评论)

fp = open(s)

p = re.compile(r'//(.+)')
txt = p.findall(fp.read())
print (txt)

The first step is to identify cases where // or /* must not be interpreted as the begining of a comment substring.第一步是确定不能将///*解释为注释子字符串开头的情况。 For example when they are inside a string (between quotes) .例如,当它们在字符串内时(引号之间) To avoid content between quotes (or other things), the trick is to put them in a capture group and to insert a backreference in the replacement pattern:为了避免引号(或其他东西)之间的内容,诀窍是将它们放在捕获组中并在替换模式中插入反向引用:

pattern:图案:

(
    "(?:[^"\\]|\\[\s\S])*"
  |
    '(?:[^'\\]|\\[\s\S])*'
)
|
//.*
|
/\*(?:[^*]|\*(?!/))*\*/

replacement:替代品:

\1

online demo在线演示

Since quoted parts are searching first, each time you find // or /*...*/ , you can be sure that your are not inside a string.由于引用的部分首先搜索,因此每次找到///*...*/时,您可以确定您不在字符串中。

Note that the pattern is voluntary inefficient (due to (A|B)* subpatterns) to make it easier to understand.请注意,该模式是自愿低效的(由于(A|B)*子模式)以使其更易于理解。 To make it more efficient you can rewrite it like this:为了提高效率,您可以像这样重写它:

("(?=((?:[^"\\]+|\\[\s\S])*))\2"|'(?=((?:[^'\\]+|\\[\s\S])*))\3')|//.*|/\*(?=((?:[^*]+|\*(?!/))*))\4\*/

(?=(something+))\1 is only a way to emulate an atomic group (?>something+) (?=(something+))\1只是模拟原子组(?>something+)的一种方式

online demo在线演示

So, If you only want to find comments (and not to remove them), the most handy is to put the comments part of the pattern in capture group and to test if it isn't empty.因此,如果您只想查找注释(而不是删除它们),最方便的是将模式的注释部分放在捕获组中并测试它是否不为空。 The following pattern has been udapted (after Jonathan Leffler comment) to handle the trigraph ??/ that is interpreted as a backslash character by the preprocessor (I assume that the code isn't written for the -trigraphs option ) and to handle the backslash followed by a newline character that allows to format a single line on several lines:以下模式已被修改(在 Jonathan Leffler 评论之后)处理被预处理器解释为反斜杠字符的三元组??/ (我假设代码不是为-trigraphs选项编写的)并处理反斜杠后跟一个换行符,允许在多行中格式化单行:

fp = open(s)

p = re.compile(r'''(?x)
(?=["'/])      # trick to make it faster, a kind of anchor
(?:
    "(?=((?:[^"\\?]+|\?(?!\?/)|(?:\?\?/|\\)[\s\S])*))\1" # double quotes string
  |
    '(?=((?:[^'\\?]+|\?(?!\?/)|(?:\?\?/|\\)[\s\S])*))\2' # single quotes string
  |
    (
        /(?:(?:\?\?/|\\)\n)*/(?:.*(?:\?\?|\\)/\n)*.* # single line comment
      |
        /(?:(?:\?\?/|\\)\n)*\*                       # multiline comment
        (?=((?:[^*]+|\*+(?!(?:(?:\?\?/|\\)\n)*/))*))\4
        \*(?:(?:\?\?/|\\)\n)*/             
    )
)
''')

for m in p.findall(fp.read()):
    if (m[2]):    
        print m[2]

These changes would not affect the pattern efficiency since the main work for the regex engine is to find positions that begin with a quote or a slash.这些更改不会影响模式效率,因为正则表达式引擎的主要工作是查找以引号或斜杠开头的位置。 This task is simplify by the presence of a lookahead at the begining of the pattern (?=["'/]) that allows internals optimizations to quickly find the first character.此任务通过在模式开始处存在的前瞻(?=["'/])来简化,它允许内部优化以快速找到第一个字符。

An other optimization is the use of emulated atomic groups, that reduces the backtracking to the minimum and allows to use greedy quantifiers inside repeated groups.另一个优化是使用模拟原子组,它将回溯减少到最小,并允许在重复组内使用贪婪量词。

NB: a chance there is no heredoc syntax in C!注意:C 中可能没有heredoc 语法!

Python's re.findall method basically works the same way as most lexers do: it successively returns the longest match starting where the previous match finished. Python 的re.findall方法基本上与大多数词法分析器的工作方式相同:它连续返回从前一个匹配完成的位置开始的最长匹配。 All that is required is to produce a disjunction of all the lexical patterns:所需要的只是产生所有词汇模式的析取:

(<pattern 1>)|(<pattern 2>)|...|(<pattern n>)

Unlike most lexers, it doesn't require the matches to be contiguous, but that's not a significant difference since you can always just add (.) as the last pattern, in order to match all otherwise unmatched characters individually.与大多数词法分析器不同,它不需要匹配是连续的,但这并不是显着的区别,因为您总是可以添加(.)作为最后一个模式,以便单独匹配所有其他不匹配的字符。

An important feature of re.findall is that if the regex has any groups, then only the groups will be returned. re.findall的一个重要特性是,如果正则表达式有任何组,那么只会返回这些组。 Consequently, you can exclude alternatives by simply leaving out the parentheses, or changing them to non-capturing parentheses:因此,您可以通过简单地省略括号或将它们更改为非捕获括号来排除替代方案:

(<pattern 1>)|(?:<unimportant pattern 2>)|(<pattern 3)

With that in mind, let's take a look at how to tokenize C just enough to recognize comments.考虑到这一点,让我们看看如何将 C 标记化到足以识别评论的程度。 We need to deal with:我们需要处理:

  1. Single-line comments: // Comment单行注释: // Comment
  2. Multi-line comments: /* Comment */多行注释: /* Comment */
  3. Double-quoted string: "Might include escapes like \n"双引号字符串: "Might include escapes like \n"
  4. Single-quoted character: '\t'单引号字符: '\t'
  5. (See below for a few more irritating cases) (请参阅下面的一些更令人恼火的案例)

With that in mind, let's create regexen for each of the above.考虑到这一点,让我们为上述每个创建正则表达式。

  1. Two slashes followed by anything other than a newline: //[^\n]*两个斜杠后跟除换行符以外的任何内容: //[^\n]*
  2. This regex is tedious to explain: /*[^*]*[*]+(?:[^/*][^*]*[*]+)*/ Note that it uses (?:...) to avoid capturing the repeated group.这个正则表达式解释起来很繁琐: /*[^*]*[*]+(?:[^/*][^*]*[*]+)*/请注意,它使用(?:...)避免捕获重复的组。
  3. A quote, any repetition of a character other than quote and backslash, or a backslash followed by any character whatsoever.引号,除引号和反斜杠之外的字符的任何重复,或反斜杠后跟任何字符。 That's not a precise definition of an escape sequence, but it's good enough to detect when a " terminates the string, which is all we care about: "(?:[^"\\]|\\.*)"这不是转义序列的精确定义,但它足以检测"何时终止字符串,这就是我们所关心的: "(?:[^"\\]|\\.*)"
  4. The same as (3) but with single quotes: '(?:[^'\\]|\\.)*'与 (3) 相同,但使用单引号: '(?:[^'\\]|\\.)*'

Finally, the goal was to find the text of C-style comments.最后,目标是找到 C 风格的注释文本。 So we just need to avoid captures in any of the other groups.所以我们只需要避免在任何其他组中捕获。 Hence:因此:

p = re.compile('|'.join((r"(//[^\n])*"
                        ,r"/*[^*]*[*]+(?:[^/*][^*]*[*]+)*/"
                        ,'"'+r"""(?:[^"\\]|\\.)*"""+'"'
                        ,r"'(?:[^'\\]|\\.)*'")))
return [c[2:] for c in p.findall(text) if c]

Above, I left out some obscure cases which are unlikely to arise:上面,我省略了一些不太可能出现的模糊案例:

  1. In an #include <...> directive, the <...> is essentially a string.#include <...>指令中, <...>本质上是一个字符串。 In theory, it could contain quotes or sequences which look like comments, but in practice you will never see:理论上,它可以包含看起来像注释的引号或序列,但实际上你永远不会看到:

     #include </*This looks like a comment but it is a filename*/>
  2. A line which ends with \ is continued on the next line;\结尾的行在下一行继续; the \ and following newline character are simply removed from the input. \和后面的换行符只是从输入中删除。 This happens before any lexical scanning is performed, so the following is a perfectly legal comment (actually two comments):这发生执行任何词法扫描之前,因此以下是完全合法的注释(实际上是两条注释):

     /\ **************** Surprise! **************\ //////////////////////////////////////////
  3. To make the above worse, the trigraph ??/ is the same as a \ , and that replacement happens before the continuation handling.更糟糕的是,三元组??/\相同,并且替换发生在继续处理之前。

     /************************************//??/ **************** Surprise! ************??/ //////////////////////////////////////////

    Outside of obfuscation contests, no-one actually uses trigraphs.在混淆竞赛之外,没有人真正使用三元组。 But they're still in the standard.但它们仍然符合标准。 The easiest way to deal with both of these issues would be to prescan the string:处理这两个问题的最简单方法是预扫描字符串:

     return [c[2:] for c in p.findall(text.replace('//?','\\').replace('\\\n','')) if c]

The only way to deal with the #include <...> issue, if you really cared about it, would be to add one more pattern, something like #define\s*<[^>\n]*> .处理#include <...>问题的唯一方法是,如果您真的关心它,那就是再添加一个模式,例如#define\s*<[^>\n]*>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM