Match C++ Strings and String Literals using regex in Python

Question

I am trying to match Strings (both between double & single quotes ) and String Literals in C++ source files . I am using the re library in Python.

I have reached the point where I can match double quotes with r'"(.*?)"' but having trouble with the syntax for extending the above regex to also match the single quotes strings (confused with the \\ and how to escape the quotes in a Python regex).

Also, from here I want to be able to match each of these cases:

" (unescaped_character|escaped_character)* "
L " (unescaped_character|escaped_character)* "
u8 " (unescaped_character|escaped_character)* "
u " (unescaped_character|escaped_character)* "
U " (unescaped_character|escaped_character)* "
prefix(optional) R "delimiter( raw_characters )delimiter"

I am so confused with regexes and all I try fail. Any suggestions and example code will be awesome for me to gain understanding and -hopefully- build all these regexes.

Answer 1

You can grab all the string literals with the following regex:

r'(?P<prefix>(?:\bu8|\b[LuU])?)(?:"(?P<dbl>[^"\\]*(?:\\.[^"\\]*)*)"|\'(?P<sngl>[^\'\\]*(?:\\.[^\'\\]*)*)\')|R"([^"(]*)\((?P<raw>.*?)\)\4"'

See the regex demo

Explanation :

(?P<prefix>(?:\\bu8|\\b[LuU])?) - (Group named "prefix") the optional prefix, either u8 (whole word) or L , u , U (as whole words)
(?:"(?P<dbl>[^"\\\\]*(?:\\\\.[^"\\\\\\\\]*)*)" - a double quoted string literal, with the contents between " captured into Group named "dbl". The part is matching " , then 0+ characters other than \\ and " followed with any number (0+) of sequences of an escape sequence ( \\\\. ) followed with 0+ characters other than \\ and " (it is an unrolled version of (?:[^"\\\\]|\\\\.)* )
| - or
\\'(?P<sngl>[^\\'\\\\]*(?:\\\\.[^\\'\\\\]*)*)\\') - a single quoted string literal, with the contents between ' captured into Group named "sngl". See details on how it works above.
| - or
R"([^"(]*)\\((?P<raw>.*?)\\)\\4" - this is a raw string literal part capturing the contents into a group named raw . First, R is matched. Then " followed with 0+ characters other than " and ( while capturing the delimiter value into Group 4 (as all named groups also have their numeric IDs), and then the inside conetents are matched with a lazy construct (use re.S if the strings are multiline), up to the first ) followed with the contents of Group 4 (the raw string literal delimiter), and then the final " .

Sample Python demo :

import re

p = re.compile(r'(?P<prefix>(?:\bu8|\b[LuU])?)(?:"(?P<dbl>[^"\\]*(?:\\.[^"\\]*)*)"|\'(?P<sngl>[^\'\\]*(?:\\.[^\'\\]*)*)\')|R"([^"(]*)\((?P<raw>.*?)\)\4"')
s = "\"text'\\\"here\"\nL'text\\'\"here'\nu8\"text'\\\"here\"\nu'text\\'\"here'\nU\"text'\\\"here\"\nR\"delimiter(text\"'\"here)delimiter\""
print(s)
print('--------- Regex works below ---------')
for x in p.finditer(s):
    if x.group("dbl"):
        print(x.group("dbl"))
    elif x.group("sngl"):
        print(x.group("sngl"))
    else:
        print(x.group("raw"))

Match C++ Strings and String Literals using regex in Python

Question

1 answers

solution1
2 ACCPTED 2016-04-13 14:56:21

Match C++ Strings and String Literals using regex in Python

Question

1 answers

solution1 2 ACCPTED 2016-04-13 14:56:21

solution1
2 ACCPTED 2016-04-13 14:56:21