简体   繁体   English

在Python中使用正则表达式匹配C ++字符串和字符串文字

[英]Match C++ Strings and String Literals using regex in Python

I am trying to match Strings (both between double & single quotes ) and String Literals in C++ source files . 我试图在C ++源文件中匹配字符串 (在双引号和单引号之间)和字符串文字 I am using the re library in Python. 我在Python中使用re库。

I have reached the point where I can match double quotes with r'"(.*?)"' but having trouble with the syntax for extending the above regex to also match the single quotes strings (confused with the \\ and how to escape the quotes in a Python regex). 我已达到可以匹配双引号与r'"(.*?)"'但是在扩展上述正则表达式的语法方面遇到问题也与单引号字符串匹配(与\\混淆以及如何逃避引用Python正则表达式)。

Also, from here I want to be able to match each of these cases: 此外,从这里我希望能够匹配以下每种情况:

  • " (unescaped_character|escaped_character)* " “(unescaped_character | escaped_character)*”

  • L " (unescaped_character|escaped_character)* " L“(unescaped_character | escaped_character)*”

  • u8 " (unescaped_character|escaped_character)* " u8“(unescaped_character | escaped_character)*”

  • u " (unescaped_character|escaped_character)* " 你“(unescaped_character | escaped_character)*”

  • U " (unescaped_character|escaped_character)* " U“(unescaped_character | escaped_character)*”

  • prefix(optional) R "delimiter( raw_characters )delimiter" 前缀(可选)R“分隔符(raw_characters)分隔符”

I am so confused with regexes and all I try fail. 我对正则表达式很困惑,所有我尝试失败。 Any suggestions and example code will be awesome for me to gain understanding and -hopefully- build all these regexes. 任何建议和示例代码对我来说都很棒,以获得理解并且 - 构建所有这些正则表达式。

You can grab all the string literals with the following regex: 您可以使用以下正则表达式获取所有字符串文字:

r'(?P<prefix>(?:\bu8|\b[LuU])?)(?:"(?P<dbl>[^"\\]*(?:\\.[^"\\]*)*)"|\'(?P<sngl>[^\'\\]*(?:\\.[^\'\\]*)*)\')|R"([^"(]*)\((?P<raw>.*?)\)\4"'

See the regex demo 请参阅正则表达式演示

Explanation : 说明

  • (?P<prefix>(?:\\bu8|\\b[LuU])?) - (Group named "prefix") the optional prefix, either u8 (whole word) or L , u , U (as whole words) (?P<prefix>(?:\\bu8|\\b[LuU])?) - (组名为“prefix”)可选前缀, u8 (整个单词)或LuU (作为整个单词)
  • (?:"(?P<dbl>[^"\\\\]*(?:\\\\.[^"\\\\\\\\]*)*)" - a double quoted string literal, with the contents between " captured into Group named "dbl". The part is matching " , then 0+ characters other than \\ and " followed with any number (0+) of sequences of an escape sequence ( \\\\. ) followed with 0+ characters other than \\ and " (it is an unrolled version of (?:[^"\\\\]|\\\\.)* ) (?:"(?P<dbl>[^"\\\\]*(?:\\\\.[^"\\\\\\\\]*)*)" - 双引号字符串文字,内容介于"捕获到名为“DBL”基团。该部分匹配"以外,则0+字符\\"随后用转义序列的序列中的任何数(0+)( \\\\.接着用比其它0+字符\\" (它是(?:[^"\\\\]|\\\\.)*的展开版本(?:[^"\\\\]|\\\\.)*
  • | - or - 要么
  • \\'(?P<sngl>[^\\'\\\\]*(?:\\\\.[^\\'\\\\]*)*)\\') - a single quoted string literal, with the contents between ' captured into Group named "sngl". \\'(?P<sngl>[^\\'\\\\]*(?:\\\\.[^\\'\\\\]*)*)\\') - 单引号字符串文字,内容介于'捕获到组名为“sngl”。 See details on how it works above. 详细了解它的工作原理。
  • | - or - 要么
  • R"([^"(]*)\\((?P<raw>.*?)\\)\\4" - this is a raw string literal part capturing the contents into a group named raw . First, R is matched. Then " followed with 0+ characters other than " and ( while capturing the delimiter value into Group 4 (as all named groups also have their numeric IDs), and then the inside conetents are matched with a lazy construct (use re.S if the strings are multiline), up to the first ) followed with the contents of Group 4 (the raw string literal delimiter), and then the final " . R"([^"(]*)\\((?P<raw>.*?)\\)\\4" - 这是一个原始字符串文字部分,将内容捕获到名为raw的组中。首先, R匹配。然后"跟着0 +以外的字符"(同时将分隔符值捕获到第4组(因为所有命名的组也有它们的数字ID),然后内部的re.S与惰性构造匹配(如果是字符串是多行的,直到第一个)接着是第4组的内容(原始字符串文字分隔符),然后是最后的"

Sample Python demo : 示例Python演示

import re

p = re.compile(r'(?P<prefix>(?:\bu8|\b[LuU])?)(?:"(?P<dbl>[^"\\]*(?:\\.[^"\\]*)*)"|\'(?P<sngl>[^\'\\]*(?:\\.[^\'\\]*)*)\')|R"([^"(]*)\((?P<raw>.*?)\)\4"')
s = "\"text'\\\"here\"\nL'text\\'\"here'\nu8\"text'\\\"here\"\nu'text\\'\"here'\nU\"text'\\\"here\"\nR\"delimiter(text\"'\"here)delimiter\""
print(s)
print('--------- Regex works below ---------')
for x in p.finditer(s):
    if x.group("dbl"):
        print(x.group("dbl"))
    elif x.group("sngl"):
        print(x.group("sngl"))
    else:
        print(x.group("raw"))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM