[英]Regular Expression, using non-greedy to catch optional string
我正在使用 PDFMiner 解析 PDF 的內容,有時,有一條線存在,有時則不存在。 我試圖表達可選行但沒有任何成功。 這是一段顯示問題的代碼:
#!/usr/bin/python3
# coding=UTF8
import re
# Simulate reading text of a PDF file with PDFMiner.
pdfContent = """
Blah blah.
Date: 2022-01-31
Optional line here which sometimes does not show
Amount: 123.45
2: Blah blah.
"""
RE = re.compile(
r".*?"
"Date:\s+(\S+).*?"
"(Optional line here which sometimes does not show){0,1}.*?"
"Amount:\s+(?P<amount>\S+)\n.*?"
, re.MULTILINE | re.DOTALL)
matches = RE.match(pdfContent)
date = matches.group(1)
optional = matches.group(2)
amount = matches.group("amount")
print(f"date = {date}")
print(f"optional = {optional}")
print(f"amount = {amount}")
output 是:
date = 2022-01-31
optional = None
amount = 123.45
為什么是可選的None ? 請注意,如果我將{0,1}
替換為{1}
,它可以工作,但是。 那么這條線不再是可選的了。 我確實使用非貪婪的.*?
到處...
這可能是重復的,但我搜索和搜索並沒有找到我的答案,因此這個問題。
您可以使用re.search
(而不是re.match
)
Date:\s+(\S+)(?:.*?(Optional line here which sometimes does not show))?.*?Amount:\s+(?P<amount>\S+)
請參閱正則表達式演示。
在這種模式中, .*?(Optional line here which sometimes does not show)?
( {0,1}
= ?
) 被一個可選的非捕獲組(?:...)?
,那必須至少嘗試一次?
是一個貪心量詞。
在您的代碼中,您可以將其用作
RE = re.compile(
r"Date:\s+(\S+)(?:.*?"
r"(Optional line here which sometimes does not show))?.*?"
r"Amount:\s+(?P<amount>\S+)",
re.DOTALL)
matches = RE.search(pdfContent)
請參閱Python 演示:
import re
pdfContent = "\n\nBlah blah.\n\nDate: 2022-01-31\n\nOptional line here which sometimes does not show\n\nAmount: 123.45\n\n2: Blah blah.\n"
RE = re.compile(
r"Date:\s+(\S+)(?:.*?"
r"(Optional line here which sometimes does not show))?.*?"
r"Amount:\s+(?P<amount>\S+)",
re.DOTALL)
matches = RE.search(pdfContent)
date = matches.group(1)
optional = matches.group(2)
amount = matches.group("amount")
print(f"date = {date}")
print(f"optional = {optional}")
print(f"amount = {amount}")
Output:
date = 2022-01-31
optional = Optional line here which sometimes does not show
amount = 123.45
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.