简体   繁体   中英

Finding all strings within source code that are not inside comments using Python

I have a C like source code, and I am trying to extract and save all the strings in this source code to a list, without including the strings that are in comments. a string in this source code can have any characters, white spaces, and even comments.

Example:

// this is an inline comment with the name "Alpha 1"

string x = "Alpha 2";
/** this is a block comment with the string "Alpha 3" */
foo("Alpha 4");
string y = "Alpha /*  */ 5 // with comments";

Output:

["Alpha 2", "Alpha 4", "Alpha /*  */ 5 // with comments"]

The problem that I can't use regex because I can have comments within a given string (which is valid), and of course I can have strings inside an inline comment or inside a block comment.

I use this method to get all strings inside a code:

re.findall(r'\"(.+?)\"', code)

but It gives me also the strings that are inside comments.

Any Help?

If the language is as simple as you describe, I think I'd write the parser by hand. I'd still use a regular expression to tokenize the input.

Here you go:

import re
from itertools import takewhile


def extract_strings(source):
    def consume(it, end):
        return list(takewhile(lambda x: x != end, it))
    tokens = iter(re.split(r'''("|/\*|\*/|//|\n)''', source))
    strings = []
    for token in tokens:
        if token == '"':
            strings.append(''.join(consume(tokens, '"')))
        elif token == '//':
            consume(tokens, '\n')
        elif token == '/*':
            consume(tokens, '*/')
    return strings

data = '''
// this is an inline comment with the name "Alpha 1"

string x = "Alpha 2";
/** this is a block comment with the string "Alpha 3" */
foo("Alpha 4");
string y = "Alpha /*  */ 5 // with comments";
'''
print(extract_strings(data))
re = r'(?:\/\/.+\n|\/\*.+\*\/)|(\".+\"|\'.+\')'

This should mostly work. Just make just that comments that are like // comment end in a newline. All words that are not in a comment will be returned in capture group one. Be careful, however, there will be None for every comment in the code.

Given:

>>> src='''\
... // this is an inline comment with the name "Alpha 1"
... 
... string x = "Alpha 2";
... /** this is a block comment with the string "Alpha 3" */
... foo("Alpha 4");
... string y = "Alpha /*  */ 5 // with comments";'''

This regex will work:

>>> pat=re.compile(r"(?:\/\/.+$|/\*[^*]*\*+(?:[^/*][^*]*\*+)*/)|(\"[^\"]*\"|\'[^']*\')", re.M)
>>> [m.group(1) for m in pat.finditer(src) if m.group(1)]
['"Alpha 2"', '"Alpha 4"', '"Alpha /*  */ 5 // with comments"']

The regex is explained here .

(But using a parser is more robust...)

On the subject of writing a parser, I take this opportunity to write my own using state machine:

提取字符串的状态机

import sys


ASTERISK = '*'
DEFAULT = 'default'
EOL = '\n'
ESCAPE = '\\'
QUOTE = '"'
SLASH = '/'


class ExtractStrings:
    def __init__(self, multiline_string):
        self.buffer = multiline_string
        self.chars_collected = ''
        self.strings = None

    def noop(self, ch):
        pass

    def collect_char(self, ch):
        self.chars_collected += ch

    def return_string(self, ch):
        self.strings.append(self.chars_collected)
        self.chars_collected = ''

    def parse(self):
        self.strings = []
        state = {
            'start': {
                QUOTE: (self.noop, 'in_string'),
                SLASH: (self.noop, 'first_slash'),
                DEFAULT: (self.noop, 'start'),
            },
            'in_string': {
                QUOTE: (self.return_string, 'start'),
                ESCAPE: (self.collect_char, 'escaping'),
                DEFAULT: (self.collect_char, 'in_string'),
            },
            'escaping': {
                DEFAULT: (self.collect_char, 'in_string'),
            },
            'first_slash': {
                SLASH: (self.noop, 'line_comment'),
                ASTERISK: (self.noop, 'block_comment'),
                DEFAULT: (self.noop, 'start'),
            },
            'line_comment': {
                EOL: (self.noop, 'start'),
                DEFAULT: (self.noop, 'line_comment'),
            },
            'block_comment': {
                ASTERISK: (self.noop, 'near_comment_block_end'),
                DEFAULT: (self.noop, 'block_comment'),
            },
            'near_comment_block_end': {
                SLASH: (self.noop, 'start'),
                ASTERISK: (self.noop, 'near_comment_block_end'),
                DEFAULT: (self.noop, 'block_comment'),
            }

        }

        current = 'start'
        for ch in self.buffer:
            default = state[current][DEFAULT]
            action, next_state = state[current].get(ch, default)
            action(ch)
            current = next_state

    def __iter__(self):
        if self.strings is None:
            self.parse()

        return iter(self.strings)

if __name__ == '__main__':
    with open(sys.argv[1]) as f:
        code = f.read()

    for string_literal in ExtractStrings(code):
        print('"%s"' % string_literal)

How does it work?

The state machine defines different states, what to do at each state (not shown in the diagram) and the transitions to the next states. Once the state machine is defined (as a nested dictionary), it is just a matter of perform the action for the state, read the next char and look up the state machine to see which state we should transition to.

The state machine is a nested dictionary. For the outer dictionary, the key is the state name and the value is the inner dictionary. For the inner dictionary, the key is the next char and the value is a tuple of (action, next state).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM