简体   繁体   中英

Python Regex reading in c style comments

Im trying to find c style comments in ac file but im having trouble if there happens to be // inside of quotations. This is the file:

/*My function
is great.*/
int j = 0//hello world
void foo(){
    //tricky example
    cout << "This // is // not a comment\n";
}

it will match with that cout. This is what i have so far (i can match the /**/ comments already)

fp = open(s)

p = re.compile(r'//(.+)')
txt = p.findall(fp.read())
print (txt)

The first step is to identify cases where // or /* must not be interpreted as the begining of a comment substring. For example when they are inside a string (between quotes) . To avoid content between quotes (or other things), the trick is to put them in a capture group and to insert a backreference in the replacement pattern:

pattern:

(
    "(?:[^"\\]|\\[\s\S])*"
  |
    '(?:[^'\\]|\\[\s\S])*'
)
|
//.*
|
/\*(?:[^*]|\*(?!/))*\*/

replacement:

\1

online demo

Since quoted parts are searching first, each time you find // or /*...*/ , you can be sure that your are not inside a string.

Note that the pattern is voluntary inefficient (due to (A|B)* subpatterns) to make it easier to understand. To make it more efficient you can rewrite it like this:

("(?=((?:[^"\\]+|\\[\s\S])*))\2"|'(?=((?:[^'\\]+|\\[\s\S])*))\3')|//.*|/\*(?=((?:[^*]+|\*(?!/))*))\4\*/

(?=(something+))\1 is only a way to emulate an atomic group (?>something+)

online demo

So, If you only want to find comments (and not to remove them), the most handy is to put the comments part of the pattern in capture group and to test if it isn't empty. The following pattern has been udapted (after Jonathan Leffler comment) to handle the trigraph ??/ that is interpreted as a backslash character by the preprocessor (I assume that the code isn't written for the -trigraphs option ) and to handle the backslash followed by a newline character that allows to format a single line on several lines:

fp = open(s)

p = re.compile(r'''(?x)
(?=["'/])      # trick to make it faster, a kind of anchor
(?:
    "(?=((?:[^"\\?]+|\?(?!\?/)|(?:\?\?/|\\)[\s\S])*))\1" # double quotes string
  |
    '(?=((?:[^'\\?]+|\?(?!\?/)|(?:\?\?/|\\)[\s\S])*))\2' # single quotes string
  |
    (
        /(?:(?:\?\?/|\\)\n)*/(?:.*(?:\?\?|\\)/\n)*.* # single line comment
      |
        /(?:(?:\?\?/|\\)\n)*\*                       # multiline comment
        (?=((?:[^*]+|\*+(?!(?:(?:\?\?/|\\)\n)*/))*))\4
        \*(?:(?:\?\?/|\\)\n)*/             
    )
)
''')

for m in p.findall(fp.read()):
    if (m[2]):    
        print m[2]

These changes would not affect the pattern efficiency since the main work for the regex engine is to find positions that begin with a quote or a slash. This task is simplify by the presence of a lookahead at the begining of the pattern (?=["'/]) that allows internals optimizations to quickly find the first character.

An other optimization is the use of emulated atomic groups, that reduces the backtracking to the minimum and allows to use greedy quantifiers inside repeated groups.

NB: a chance there is no heredoc syntax in C!

Python's re.findall method basically works the same way as most lexers do: it successively returns the longest match starting where the previous match finished. All that is required is to produce a disjunction of all the lexical patterns:

(<pattern 1>)|(<pattern 2>)|...|(<pattern n>)

Unlike most lexers, it doesn't require the matches to be contiguous, but that's not a significant difference since you can always just add (.) as the last pattern, in order to match all otherwise unmatched characters individually.

An important feature of re.findall is that if the regex has any groups, then only the groups will be returned. Consequently, you can exclude alternatives by simply leaving out the parentheses, or changing them to non-capturing parentheses:

(<pattern 1>)|(?:<unimportant pattern 2>)|(<pattern 3)

With that in mind, let's take a look at how to tokenize C just enough to recognize comments. We need to deal with:

  1. Single-line comments: // Comment
  2. Multi-line comments: /* Comment */
  3. Double-quoted string: "Might include escapes like \n"
  4. Single-quoted character: '\t'
  5. (See below for a few more irritating cases)

With that in mind, let's create regexen for each of the above.

  1. Two slashes followed by anything other than a newline: //[^\n]*
  2. This regex is tedious to explain: /*[^*]*[*]+(?:[^/*][^*]*[*]+)*/ Note that it uses (?:...) to avoid capturing the repeated group.
  3. A quote, any repetition of a character other than quote and backslash, or a backslash followed by any character whatsoever. That's not a precise definition of an escape sequence, but it's good enough to detect when a " terminates the string, which is all we care about: "(?:[^"\\]|\\.*)"
  4. The same as (3) but with single quotes: '(?:[^'\\]|\\.)*'

Finally, the goal was to find the text of C-style comments. So we just need to avoid captures in any of the other groups. Hence:

p = re.compile('|'.join((r"(//[^\n])*"
                        ,r"/*[^*]*[*]+(?:[^/*][^*]*[*]+)*/"
                        ,'"'+r"""(?:[^"\\]|\\.)*"""+'"'
                        ,r"'(?:[^'\\]|\\.)*'")))
return [c[2:] for c in p.findall(text) if c]

Above, I left out some obscure cases which are unlikely to arise:

  1. In an #include <...> directive, the <...> is essentially a string. In theory, it could contain quotes or sequences which look like comments, but in practice you will never see:

     #include </*This looks like a comment but it is a filename*/>
  2. A line which ends with \ is continued on the next line; the \ and following newline character are simply removed from the input. This happens before any lexical scanning is performed, so the following is a perfectly legal comment (actually two comments):

     /\ **************** Surprise! **************\ //////////////////////////////////////////
  3. To make the above worse, the trigraph ??/ is the same as a \ , and that replacement happens before the continuation handling.

     /************************************//??/ **************** Surprise! ************??/ //////////////////////////////////////////

    Outside of obfuscation contests, no-one actually uses trigraphs. But they're still in the standard. The easiest way to deal with both of these issues would be to prescan the string:

     return [c[2:] for c in p.findall(text.replace('//?','\\').replace('\\\n','')) if c]

The only way to deal with the #include <...> issue, if you really cared about it, would be to add one more pattern, something like #define\s*<[^>\n]*> .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM