简体   繁体   中英

Match C++ Strings and String Literals using regex in Python

I am trying to match Strings (both between double & single quotes ) and String Literals in C++ source files . I am using the re library in Python.

I have reached the point where I can match double quotes with r'"(.*?)"' but having trouble with the syntax for extending the above regex to also match the single quotes strings (confused with the \\ and how to escape the quotes in a Python regex).

Also, from here I want to be able to match each of these cases:

  • " (unescaped_character|escaped_character)* "

  • L " (unescaped_character|escaped_character)* "

  • u8 " (unescaped_character|escaped_character)* "

  • u " (unescaped_character|escaped_character)* "

  • U " (unescaped_character|escaped_character)* "

  • prefix(optional) R "delimiter( raw_characters )delimiter"

I am so confused with regexes and all I try fail. Any suggestions and example code will be awesome for me to gain understanding and -hopefully- build all these regexes.

You can grab all the string literals with the following regex:

r'(?P<prefix>(?:\bu8|\b[LuU])?)(?:"(?P<dbl>[^"\\]*(?:\\.[^"\\]*)*)"|\'(?P<sngl>[^\'\\]*(?:\\.[^\'\\]*)*)\')|R"([^"(]*)\((?P<raw>.*?)\)\4"'

See the regex demo

Explanation :

  • (?P<prefix>(?:\\bu8|\\b[LuU])?) - (Group named "prefix") the optional prefix, either u8 (whole word) or L , u , U (as whole words)
  • (?:"(?P<dbl>[^"\\\\]*(?:\\\\.[^"\\\\\\\\]*)*)" - a double quoted string literal, with the contents between " captured into Group named "dbl". The part is matching " , then 0+ characters other than \\ and " followed with any number (0+) of sequences of an escape sequence ( \\\\. ) followed with 0+ characters other than \\ and " (it is an unrolled version of (?:[^"\\\\]|\\\\.)* )
  • | - or
  • \\'(?P<sngl>[^\\'\\\\]*(?:\\\\.[^\\'\\\\]*)*)\\') - a single quoted string literal, with the contents between ' captured into Group named "sngl". See details on how it works above.
  • | - or
  • R"([^"(]*)\\((?P<raw>.*?)\\)\\4" - this is a raw string literal part capturing the contents into a group named raw . First, R is matched. Then " followed with 0+ characters other than " and ( while capturing the delimiter value into Group 4 (as all named groups also have their numeric IDs), and then the inside conetents are matched with a lazy construct (use re.S if the strings are multiline), up to the first ) followed with the contents of Group 4 (the raw string literal delimiter), and then the final " .

Sample Python demo :

import re

p = re.compile(r'(?P<prefix>(?:\bu8|\b[LuU])?)(?:"(?P<dbl>[^"\\]*(?:\\.[^"\\]*)*)"|\'(?P<sngl>[^\'\\]*(?:\\.[^\'\\]*)*)\')|R"([^"(]*)\((?P<raw>.*?)\)\4"')
s = "\"text'\\\"here\"\nL'text\\'\"here'\nu8\"text'\\\"here\"\nu'text\\'\"here'\nU\"text'\\\"here\"\nR\"delimiter(text\"'\"here)delimiter\""
print(s)
print('--------- Regex works below ---------')
for x in p.finditer(s):
    if x.group("dbl"):
        print(x.group("dbl"))
    elif x.group("sngl"):
        print(x.group("sngl"))
    else:
        print(x.group("raw"))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM