[英]Regex to parse dialogue in python
I want to parse three type of line from a file in python:我想从 python 中的文件中解析三种类型的行:
"Name" "Something to say !"
"Just a descriptive sentence"
name "Something to say !"
I want to get the name and the sentence and if there is no name just the setence.我想得到名字和句子,如果没有名字,只有句子。 I read each line of the file use
re
to see if the regex match.我阅读了文件的每一行,使用
re
来查看正则表达式是否匹配。 It works pretty except for this one:它工作得很好,除了这个:
"Name" "Something to say !"
It just returns the whole thing instead of two parts.它只是返回整个事物而不是两个部分。
Here is my regex:这是我的正则表达式:
r"(\"[a-zA-z?]*\"|[a-zA-z]*)\s\"(.+)\""
You might use a capture group for "
with a backreference to either match or not match the accompanying double quote.您可以使用带有反向引用
"
捕获组来匹配或不匹配随附的双引号。
Then you can make the whole first part including the whitespace char optional, and match the second part between double quotes.然后你可以使包括空白字符在内的整个第一部分成为可选的,并在双引号之间匹配第二部分。
Note that [a-zA-z]
matches more than [a-zA-Z]
and the ?
请注意,
[a-zA-z]
匹配的次数多于[a-zA-Z]
和?
inside the character class matches the question mark literally.在字符 class 内部与问号字面匹配。
The matches are in group 1 and group 3.比赛分在第 1 组和第 3 组。
(?:(("?)[a-zA-Z]+\2)\s)?("[^"]+")
(?:
Non capture group (?:
非捕获组
(
Capture group 1 (
捕获组 1
("?)
Capture an optional "
in group 2 ("?)
在组 2 中捕获一个可选的"
[a-zA-Z]+
Match a+ times a char a-zA-Z a [a-zA-Z]+
匹配 a+ 次 a char a-zA-Z a\2
A backreference to group 2 to match exactly what is matched in that group \2
对组 2 的反向引用以完全匹配该组中的匹配项)\s
Close group 1 and match a whitespace char )\s
关闭第 1 组并匹配一个空白字符)?
Close the non capture group and make it optional("[^"]+")
Capture group 3 , match from "
till "
("[^"]+")
捕获组 3 ,匹配从"
到"
See a regex demo |查看正则表达式演示| Python demo
Python 演示
Example using re.finditer looping the matches:使用 re.finditer 循环匹配的示例:
import re
regex = r"(?:((\"?)[a-zA-Z]+\2)\s)?(\"[^\"]+\")"
s = ("\"Name\" \"Something to say !\"\n"
"\"Just a descriptive sentence\"\n"
"name \"Something to say !\"\n"
"\"Name\" \"Something to say !\"")
matches = re.finditer(regex, s)
for matchNum, match in enumerate(matches, start=1):
print(f"Name: {match.group(1)} Sentence: {match.group(3)}")
Output Output
Name: "Name" Sentence: "Something to say !"
Name: None Sentence: "Just a descriptive sentence"
Name: name Sentence: "Something to say !"
Name: "Name" Sentence: "Something to say !"
Your best option in my view is to use named capture groups.在我看来,您最好的选择是使用命名捕获组。 Here's how:
就是这样:
import re
lines = [
'"Name" "Something to say !"',
'"Just a descriptive sentence"',
'name "Something to say !"'
]
p = re.compile(r"(\"?(?P<part1>.+?)\"? )?(\"(?P<part2>.+)\")")
for line in lines:
m = p.search(line)
print(m["part1"])
print(m["part2"])
The output will be output 将是
Name
Something to say !
None
Just a descriptive sentence
name
Something to say !
The regex (\"?(?P<part1>.+?)\"? )?(\"(?P<part2>.+)\")
consists of two main parts.正则表达式
(\"?(?P<part1>.+?)\"? )?(\"(?P<part2>.+)\")
由两个主要部分组成。 I'll go through the first one, (\"?(?P<part1>.+?)\"? )?
我将通过第一个 go
(\"?(?P<part1>.+?)\"? )?
. . The second one is very similar.
第二个非常相似。
(...)?
(...)?
with the "zero or more" quantifier ?
?
. \"?
\"?
(?P<part1>.+?)
matches the text between the quotes and assigns the name "part1" for easy access.(?P<part1>.+?)
匹配引号之间的文本并指定名称“part1”以便于访问。
.
matches all symbols+?
matches one or more of the previous lazily (as many characters as needed, as few as feasible). With this regex, you can access the content of the named capturing groups via square-bracket syntax, as shown in the code above.使用此正则表达式,您可以通过方括号语法访问命名捕获组的内容,如上面的代码所示。
If you want to capture not only the text in quotes, but also the quotes themselves, simply move the \"
inside the named capturing groups like so: ((?P<part1>\"?.+?\")? )?((?P<part2>\".+\"))
如果您不仅要捕获引号中的文本,还要捕获引号本身,只需将
\"
移动到命名的捕获组中,如下所示: ((?P<part1>\"?.+?\")? )?((?P<part2>\".+\"))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.