简体   繁体   中英

Python regex for matching single line and multi line comments.

I'm trying to create a python regex, for PLY, which will match comments of the form

// some comment

and

/* comment
   more comment */

So I tried

t_COMMENT = r'//.+ | /\*.+\*/'

but this doesn't allow for multi line comments and when I try to solve this using the 'dot matches all' options like

t_COMMENT = r'//.+ | (?s) /\*.+\*/'

it results in the '//' comment type matching many lines. Also if I try to have two separate regexes like

t_COMMENT = r'//.+' 
t_COMMENT2 = r'(?s) /\*.+\*/'

the '//' comment type still matches multiple lines as though the dot matches all option is selected.

Does anybody know how to solve this?

The below regex would match both type of comments,

(?://[^\n]*|/\*(?:(?!\*/).)*\*/)

DEMO

>>> s = """// some comment
... 
... foo
... bar
... foobar
... /* comment
...    more comment */ bar"""
>>> m = re.findall(r'(?://[^\n]*|/\*(?:(?!\*/).)*\*/)', s, re.DOTALL)
>>> m
['// some comment', '/* comment\n   more comment */']

According to PLY Doc it can be accomplished with 'Conditional lexing'. It might be more readable, and easier to debug, than complex regular expression. The example they give is a little bit more complicated, since it keeps track of nesting levels, and the content inside the block. However, your case is simpler, since you don't need all that info.

The code for multi line comment should be something like this:

# I'd prefer 'multi_line_comment', but it appears that 
# state names cannot have underscore in them
states = (
    ('multiLineComment','exclusive'),
)

def t_multiLineComment_start(t):
    r'/\*'
    t.lexer.begin('multiLineComment')          

def t_multiLineComment_end):
    r'\*/'
    t.lexer.begin('INITIAL')           

def t_multiLineComment_newline(t):
    r'\n'
    pass

# catch (and ignore) anything that isn't end-of-comment
def t_multiLineComment_content(t):
    r'[^(\*/)]'
    pass

Of course, you'll have to have another rule, under the regular state, for // comments.

这是Avinash解决方案的一个小变化。

pat = re.compile(r'(?://.*?$)|(?:/\\*.*?\\*/)', re.M|re.S)

这可能有用

 (/\*(.|\n)*?*/)|(//.*)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM