[英]Regular Expression, using non-greedy to catch optional string
I am parsing the content of a PDF with PDFMiner and sometimes, there is a line that is present and other time not.我正在使用 PDFMiner 解析 PDF 的内容,有时,有一条线存在,有时则不存在。 I am trying to express the optional line without any success.我试图表达可选行但没有任何成功。 Here is a piece of code that shows the problem:这是一段显示问题的代码:
#!/usr/bin/python3
# coding=UTF8
import re
# Simulate reading text of a PDF file with PDFMiner.
pdfContent = """
Blah blah.
Date: 2022-01-31
Optional line here which sometimes does not show
Amount: 123.45
2: Blah blah.
"""
RE = re.compile(
r".*?"
"Date:\s+(\S+).*?"
"(Optional line here which sometimes does not show){0,1}.*?"
"Amount:\s+(?P<amount>\S+)\n.*?"
, re.MULTILINE | re.DOTALL)
matches = RE.match(pdfContent)
date = matches.group(1)
optional = matches.group(2)
amount = matches.group("amount")
print(f"date = {date}")
print(f"optional = {optional}")
print(f"amount = {amount}")
The output is: output 是:
date = 2022-01-31
optional = None
amount = 123.45
Why is optional None ?为什么是可选的None ? Notice that if I replace the {0,1}
with {1}
, it works, But.请注意,如果我将{0,1}
替换为{1}
,它可以工作,但是。 then the line is not optional anymore.那么这条线不再是可选的了。 I do use the non-greedy .*?
我确实使用非贪婪的.*?
everywhere...到处...
This is probably a duplicate, but I searched and searched and did not find my answer, thus this question.这可能是重复的,但我搜索和搜索并没有找到我的答案,因此这个问题。
You can use re.search
(instead of re.match
) with您可以使用re.search
(而不是re.match
)
Date:\s+(\S+)(?:.*?(Optional line here which sometimes does not show))?.*?Amount:\s+(?P<amount>\S+)
See the regex demo .请参阅正则表达式演示。
In this pattern, .*?(Optional line here which sometimes does not show)?
在这种模式中, .*?(Optional line here which sometimes does not show)?
( {0,1}
= ?
) is wrapped with an optional non-capturing group, (?:...)?
( {0,1}
= ?
) 被一个可选的非捕获组(?:...)?
, that must be tried at least once since ?
,那必须至少尝试一次?
is a greedy quantifier.是一个贪心量词。
In your code, you can use it as在您的代码中,您可以将其用作
RE = re.compile(
r"Date:\s+(\S+)(?:.*?"
r"(Optional line here which sometimes does not show))?.*?"
r"Amount:\s+(?P<amount>\S+)",
re.DOTALL)
matches = RE.search(pdfContent)
See the Python demo :请参阅Python 演示:
import re
pdfContent = "\n\nBlah blah.\n\nDate: 2022-01-31\n\nOptional line here which sometimes does not show\n\nAmount: 123.45\n\n2: Blah blah.\n"
RE = re.compile(
r"Date:\s+(\S+)(?:.*?"
r"(Optional line here which sometimes does not show))?.*?"
r"Amount:\s+(?P<amount>\S+)",
re.DOTALL)
matches = RE.search(pdfContent)
date = matches.group(1)
optional = matches.group(2)
amount = matches.group("amount")
print(f"date = {date}")
print(f"optional = {optional}")
print(f"amount = {amount}")
Output: Output:
date = 2022-01-31
optional = Optional line here which sometimes does not show
amount = 123.45
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.