I am trying to match Strings (both between double & single quotes ) and String Literals in C++ source files . I am using the re
library in Python.
I have reached the point where I can match double quotes with r'"(.*?)"'
but having trouble with the syntax for extending the above regex to also match the single quotes strings (confused with the \\
and how to escape the quotes in a Python regex).
Also, from here I want to be able to match each of these cases:
" (unescaped_character|escaped_character)* "
L " (unescaped_character|escaped_character)* "
u8 " (unescaped_character|escaped_character)* "
u " (unescaped_character|escaped_character)* "
U " (unescaped_character|escaped_character)* "
prefix(optional) R "delimiter( raw_characters )delimiter"
I am so confused with regexes and all I try fail. Any suggestions and example code will be awesome for me to gain understanding and -hopefully- build all these regexes.
You can grab all the string literals with the following regex:
r'(?P<prefix>(?:\bu8|\b[LuU])?)(?:"(?P<dbl>[^"\\]*(?:\\.[^"\\]*)*)"|\'(?P<sngl>[^\'\\]*(?:\\.[^\'\\]*)*)\')|R"([^"(]*)\((?P<raw>.*?)\)\4"'
See the regex demo
Explanation :
(?P<prefix>(?:\\bu8|\\b[LuU])?)
- (Group named "prefix") the optional prefix, either u8
(whole word) or L
, u
, U
(as whole words) (?:"(?P<dbl>[^"\\\\]*(?:\\\\.[^"\\\\\\\\]*)*)"
- a double quoted string literal, with the contents between "
captured into Group named "dbl". The part is matching "
, then 0+ characters other than \\
and "
followed with any number (0+) of sequences of an escape sequence ( \\\\.
) followed with 0+ characters other than \\
and "
(it is an unrolled version of (?:[^"\\\\]|\\\\.)*
) |
- or \\'(?P<sngl>[^\\'\\\\]*(?:\\\\.[^\\'\\\\]*)*)\\')
- a single quoted string literal, with the contents between '
captured into Group named "sngl". See details on how it works above. |
- or R"([^"(]*)\\((?P<raw>.*?)\\)\\4"
- this is a raw string literal part capturing the contents into a group named raw
. First, R
is matched. Then "
followed with 0+ characters other than "
and (
while capturing the delimiter value into Group 4 (as all named groups also have their numeric IDs), and then the inside conetents are matched with a lazy construct (use re.S
if the strings are multiline), up to the first )
followed with the contents of Group 4 (the raw string literal delimiter), and then the final "
. Sample Python demo :
import re
p = re.compile(r'(?P<prefix>(?:\bu8|\b[LuU])?)(?:"(?P<dbl>[^"\\]*(?:\\.[^"\\]*)*)"|\'(?P<sngl>[^\'\\]*(?:\\.[^\'\\]*)*)\')|R"([^"(]*)\((?P<raw>.*?)\)\4"')
s = "\"text'\\\"here\"\nL'text\\'\"here'\nu8\"text'\\\"here\"\nu'text\\'\"here'\nU\"text'\\\"here\"\nR\"delimiter(text\"'\"here)delimiter\""
print(s)
print('--------- Regex works below ---------')
for x in p.finditer(s):
if x.group("dbl"):
print(x.group("dbl"))
elif x.group("sngl"):
print(x.group("sngl"))
else:
print(x.group("raw"))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.