[英]Regular Expression, using non-greedy to catch optional string
我正在使用 PDFMiner 解析 PDF 的内容,有时,有一条线存在,有时则不存在。 我试图表达可选行但没有任何成功。 这是一段显示问题的代码:
#!/usr/bin/python3
# coding=UTF8
import re
# Simulate reading text of a PDF file with PDFMiner.
pdfContent = """
Blah blah.
Date: 2022-01-31
Optional line here which sometimes does not show
Amount: 123.45
2: Blah blah.
"""
RE = re.compile(
r".*?"
"Date:\s+(\S+).*?"
"(Optional line here which sometimes does not show){0,1}.*?"
"Amount:\s+(?P<amount>\S+)\n.*?"
, re.MULTILINE | re.DOTALL)
matches = RE.match(pdfContent)
date = matches.group(1)
optional = matches.group(2)
amount = matches.group("amount")
print(f"date = {date}")
print(f"optional = {optional}")
print(f"amount = {amount}")
output 是:
date = 2022-01-31
optional = None
amount = 123.45
为什么是可选的None ? 请注意,如果我将{0,1}
替换为{1}
,它可以工作,但是。 那么这条线不再是可选的了。 我确实使用非贪婪的.*?
到处...
这可能是重复的,但我搜索和搜索并没有找到我的答案,因此这个问题。
您可以使用re.search
(而不是re.match
)
Date:\s+(\S+)(?:.*?(Optional line here which sometimes does not show))?.*?Amount:\s+(?P<amount>\S+)
请参阅正则表达式演示。
在这种模式中, .*?(Optional line here which sometimes does not show)?
( {0,1}
= ?
) 被一个可选的非捕获组(?:...)?
,那必须至少尝试一次?
是一个贪心量词。
在您的代码中,您可以将其用作
RE = re.compile(
r"Date:\s+(\S+)(?:.*?"
r"(Optional line here which sometimes does not show))?.*?"
r"Amount:\s+(?P<amount>\S+)",
re.DOTALL)
matches = RE.search(pdfContent)
请参阅Python 演示:
import re
pdfContent = "\n\nBlah blah.\n\nDate: 2022-01-31\n\nOptional line here which sometimes does not show\n\nAmount: 123.45\n\n2: Blah blah.\n"
RE = re.compile(
r"Date:\s+(\S+)(?:.*?"
r"(Optional line here which sometimes does not show))?.*?"
r"Amount:\s+(?P<amount>\S+)",
re.DOTALL)
matches = RE.search(pdfContent)
date = matches.group(1)
optional = matches.group(2)
amount = matches.group("amount")
print(f"date = {date}")
print(f"optional = {optional}")
print(f"amount = {amount}")
Output:
date = 2022-01-31
optional = Optional line here which sometimes does not show
amount = 123.45
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.