简体   繁体   English

正则表达式解析 python 中的对话

[英]Regex to parse dialogue in python

I want to parse three type of line from a file in python:我想从 python 中的文件中解析三种类型的行:

"Name" "Something to say !"
"Just a descriptive sentence"
name "Something to say !"

I want to get the name and the sentence and if there is no name just the setence.我想得到名字和句子,如果没有名字,只有句子。 I read each line of the file use re to see if the regex match.我阅读了文件的每一行,使用re来查看正则表达式是否匹配。 It works pretty except for this one:它工作得很好,除了这个:

"Name" "Something to say !"

It just returns the whole thing instead of two parts.它只是返回整个事物而不是两个部分。

Here is my regex:这是我的正则表达式:

r"(\"[a-zA-z?]*\"|[a-zA-z]*)\s\"(.+)\""

You might use a capture group for " with a backreference to either match or not match the accompanying double quote.您可以使用带有反向引用"捕获组来匹配或不匹配随附的双引号。

Then you can make the whole first part including the whitespace char optional, and match the second part between double quotes.然后你可以使包括空白字符在内的整个第一部分成为可选的,并在双引号之间匹配第二部分。

Note that [a-zA-z] matches more than [a-zA-Z] and the ?请注意, [a-zA-z] 匹配的次数多于[a-zA-Z]? inside the character class matches the question mark literally.在字符 class 内部与问号字面匹配。

The matches are in group 1 and group 3.比赛分在第 1 组和第 3 组。

(?:(("?)[a-zA-Z]+\2)\s)?("[^"]+")
  • (?: Non capture group (?:非捕获组
    • ( Capture group 1 (捕获组 1
      • ("?) Capture an optional " in group 2 ("?)在组 2 中捕获一个可选的"
      • [a-zA-Z]+ Match a+ times a char a-zA-Z a [a-zA-Z]+匹配 a+ 次 a char a-zA-Z a
      • \2 A backreference to group 2 to match exactly what is matched in that group \2对组 2 的反向引用以完全匹配该组中的匹配项
    • )\s Close group 1 and match a whitespace char )\s关闭第 1 组并匹配一个空白字符
  • )? Close the non capture group and make it optional关闭非捕获组并使其可选
  • ("[^"]+") Capture group 3 , match from " till " ("[^"]+")捕获组 3 ,匹配从""

See a regex demo |查看正则表达式演示| Python demo Python 演示

Example using re.finditer looping the matches:使用 re.finditer 循环匹配的示例:

import re

regex = r"(?:((\"?)[a-zA-Z]+\2)\s)?(\"[^\"]+\")"
s = ("\"Name\" \"Something to say !\"\n"
            "\"Just a descriptive sentence\"\n"
            "name \"Something to say !\"\n"
            "\"Name\" \"Something to say !\"")

matches = re.finditer(regex, s)
for matchNum, match in enumerate(matches, start=1):
        print(f"Name: {match.group(1)} Sentence: {match.group(3)}")

Output Output

Name: "Name" Sentence: "Something to say !"
Name: None Sentence: "Just a descriptive sentence"
Name: name Sentence: "Something to say !"
Name: "Name" Sentence: "Something to say !"

Solution解决方案

Your best option in my view is to use named capture groups.在我看来,您最好的选择是使用命名捕获组。 Here's how:就是这样:

import re

lines = [
    '"Name" "Something to say !"',
    '"Just a descriptive sentence"',
    'name "Something to say !"'
    ]

p = re.compile(r"(\"?(?P<part1>.+?)\"? )?(\"(?P<part2>.+)\")")

for line in lines:
    m = p.search(line)
    print(m["part1"])
    print(m["part2"])

The output will be output 将是

Name
Something to say !
None
Just a descriptive sentence
name
Something to say !

Explanation解释

The regex (\"?(?P<part1>.+?)\"? )?(\"(?P<part2>.+)\") consists of two main parts.正则表达式(\"?(?P<part1>.+?)\"? )?(\"(?P<part2>.+)\")由两个主要部分组成。 I'll go through the first one, (\"?(?P<part1>.+?)\"? )?我将通过第一个 go (\"?(?P<part1>.+?)\"? )? . . The second one is very similar.第二个非常相似。

  • An outer group (...)?外部组(...)? with the "zero or more" quantifier ?使用“零或更多”量词? . . So in your second case, only the 'part2' capturing group will be active.因此,在您的第二种情况下,只有“part2”捕获组将处于活动状态。
  • Inside this group, the quotes are also marked with the "zero or more" quantifier to cover your third case: \"?在该组中,引号还标有“零或多个”量词以涵盖您的第三种情况: \"?
  • The part (?P<part1>.+?) matches the text between the quotes and assigns the name "part1" for easy access.部分(?P<part1>.+?)匹配引号之间的文本并指定名称“part1”以便于访问。
    • . matches all symbols匹配所有符号
    • +? matches one or more of the previous lazily (as many characters as needed, as few as feasible).惰性匹配前面的一个或多个(尽可能多的字符,尽可能少)。 This is needed to exclude the second quote from the match.这是从匹配中排除第二个引号所必需的。

With this regex, you can access the content of the named capturing groups via square-bracket syntax, as shown in the code above.使用此正则表达式,您可以通过方括号语法访问命名捕获组的内容,如上面的代码所示。

Capturing the quotes捕捉报价

If you want to capture not only the text in quotes, but also the quotes themselves, simply move the \" inside the named capturing groups like so: ((?P<part1>\"?.+?\")? )?((?P<part2>\".+\"))如果您不仅要捕获引号中的文本,还要捕获引号本身,只需将\"移动到命名的捕获组中,如下所示: ((?P<part1>\"?.+?\")? )?((?P<part2>\".+\"))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM