正则表达式，使用非贪婪捕获可选字符串

Question

I am parsing the content of a PDF with PDFMiner and sometimes, there is a line that is present and other time not.我正在使用 PDFMiner 解析 PDF 的内容，有时，有一条线存在，有时则不存在。 I am trying to express the optional line without any success.我试图表达可选行但没有任何成功。 Here is a piece of code that shows the problem:这是一段显示问题的代码：

#!/usr/bin/python3
# coding=UTF8

import re

# Simulate reading text of a PDF file with PDFMiner.
pdfContent = """

Blah blah.

Date:  2022-01-31

Optional line here which sometimes does not show

Amount:  123.45

2: Blah blah.

"""

RE = re.compile(
    r".*?"
    "Date:\s+(\S+).*?"
    "(Optional line here which sometimes does not show){0,1}.*?"
    "Amount:\s+(?P<amount>\S+)\n.*?"
    , re.MULTILINE | re.DOTALL)

matches = RE.match(pdfContent)

date     = matches.group(1)
optional = matches.group(2)
amount   = matches.group("amount")

print(f"date     = {date}")
print(f"optional = {optional}")
print(f"amount   = {amount}")

The output is: output 是：

date     = 2022-01-31
optional = None
amount   = 123.45

Why is optional None ?为什么是可选的None ？ Notice that if I replace the {0,1} with {1} , it works, But.请注意，如果我将{0,1}替换为{1} ，它可以工作，但是。 then the line is not optional anymore.那么这条线不再是可选的了。 I do use the non-greedy .*?我确实使用非贪婪的.*? everywhere...到处...

This is probably a duplicate, but I searched and searched and did not find my answer, thus this question.这可能是重复的，但我搜索和搜索并没有找到我的答案，因此这个问题。

Answer 1

You can use re.search (instead of re.match ) with您可以使用re.search （而不是re.match ）

Date:\s+(\S+)(?:.*?(Optional line here which sometimes does not show))?.*?Amount:\s+(?P<amount>\S+)

See the regex demo .请参阅正则表达式演示。

In this pattern, .*?(Optional line here which sometimes does not show)?在这种模式中， .*?(Optional line here which sometimes does not show)? ( {0,1} = ? ) is wrapped with an optional non-capturing group, (?:...)? ( {0,1} = ? ) 被一个可选的非捕获组(?:...)? , that must be tried at least once since ? ，那必须至少尝试一次? is a greedy quantifier.是一个贪心量词。

In your code, you can use it as在您的代码中，您可以将其用作

RE = re.compile(
    r"Date:\s+(\S+)(?:.*?"
    r"(Optional line here which sometimes does not show))?.*?"
    r"Amount:\s+(?P<amount>\S+)",
    re.DOTALL)

matches = RE.search(pdfContent)

See the Python demo :请参阅Python 演示：

import re
 
pdfContent = "\n\nBlah blah.\n\nDate:  2022-01-31\n\nOptional line here which sometimes does not show\n\nAmount:  123.45\n\n2: Blah blah.\n"
 
RE = re.compile(
    r"Date:\s+(\S+)(?:.*?"
    r"(Optional line here which sometimes does not show))?.*?"
    r"Amount:\s+(?P<amount>\S+)",
    re.DOTALL)
 
matches = RE.search(pdfContent)
date     = matches.group(1)
optional = matches.group(2)
amount   = matches.group("amount")
 
print(f"date     = {date}")
print(f"optional = {optional}")
print(f"amount   = {amount}")

Output: Output：

date     = 2022-01-31
optional = Optional line here which sometimes does not show
amount   = 123.45

正则表达式，使用非贪婪捕获可选字符串

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-08-20 17:36:57

正则表达式，使用非贪婪捕获可选字符串

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-08-20 17:36:57

解决方案1
1 已采纳 2022-08-20 17:36:57