Python 正则表达式匹配多行字符串

Question

my_str:我的字符串：

PCT Filing Date: 2 December 2015
\nApplicants: Silixa Ltd.
\nChevron U.S.A. Inc. (Incorporated
in USA - California)
\nInventors: Farhadiroushan,
Mahmoud
\nGillies, Arran
Parker, Tom'

my code我的代码

regex = re.compile(r'(Applicants:)( )?(.*)', re.MULTILINE)
print(regex.findall(text))

my output:我的 output：

[('Applicants:', ' ', 'Silixa Ltd.')]

what I need is to get the string between 'Applicants:' and '\nInventors:'我需要的是获取 'Applicants:' 和 '\nInventors:' 之间的字符串

'Silixa Ltd.' & 'Chevron U.S.A. Inc. (Incorporated
in USA - California)'

Thanks in advance for your help在此先感谢您的帮助

Answer 1

Try using re.DOTALL instead:尝试使用 re.DOTALL 代替：

import re

text='''PCT Filing Date: 2 December 2015
\nApplicants: Silixa Ltd.
\nChevron U.S.A. Inc. (Incorporated
in USA - California)
\nInventors: Farhadiroushan,
Mahmoud
\nGillies, Arran
Parker, Tom'''

regex = re.compile(r'Applicants:(.*?)Inventors:', re.DOTALL)
print(regex.findall(text))

gives me给我

$ python test.py
[' Silixa Ltd.\n\nChevron U.S.A. Inc. (Incorporated\nin USA - California)\n\n']

The reason this works is that MULTILINE doesn't let the dot (.) match newlines, whereas DOTALL will.这样做的原因是 MULTILINE 不会让点 (.) 匹配换行符，而 DOTALL 会。

Answer 2

If what you want is the contents between Applicants: and \nInventors: , your regex should reflect that:如果您想要的是Applicants:和\nInventors:之间的内容，则您的正则表达式应反映：

>>> regex = re.compile(r'Applicants: (.*)Inventors:', re.S)
>>> print(regex.findall(s))
['Silixa Ltd.\n\nChevron U.S.A. Inc. (Incorporated\nin USA - California)\n']

re.S is the "dot matches all" option, so our (.*) will also match new lines. re.S是“点匹配所有”选项，所以我们的(.*)也将匹配新行。 Note that this is different from re.MULTILINE , because re.MULTILINE only says that our expression should apply to multiple lines, but doesn't change the fact .请注意，这与re.MULTILINE不同，因为re.MULTILINE只表示我们的表达式应该应用于多行，但不会改变事实. will not match newlines.不会匹配换行符。 If .如果. doesn't match newlines, a match like (.*) will still stop at newlines, not achieving the multiline effect you want.不匹配换行符，像(.*)这样的匹配仍然会在换行符处停止，无法达到您想要的多行效果。

Also note that if you are not interested in Applicants: or Inventors: you may not want to put that between () , as in (Inventors:) in your regex, because the match will try to create a matching group for it.另请注意，如果您对Applicants:或Inventors:不感兴趣，您可能不希望将其放在()之间，如(Inventors:)中的正则表达式，因为匹配将尝试为其创建匹配组。 That's the reason you got 3 elements in your output instead of just 1.这就是您在 output 中获得 3 个元素而不是只有 1 个的原因。

Answer 3

If you want to match all the text between \nApplicants: and \nInventors: , you could also get the match without using re.DOTALL preventing unnecessary backtracking.如果你想匹配 \nApplicants \nApplicants:和\nInventors:之间的所有文本，你也可以在不使用re.DOTALL的情况下获得匹配，以防止不必要的回溯。

Match Applicants: and capture in group 1 the rest of that same line and all lines that follow that do not start with Inventors:匹配Applicants:并在第 1 组中捕获同一行的 rest 以及后面所有不以Inventors:

Then match Inventors.然后匹配发明家。

^Applicants: (.*(?:\r?\n(?!Inventors:).*)*)\r?\nInventors:

^ Start of string (Or use \b if it does not have to be at the start) ^字符串的开头（如果不必在开头，则使用\b ）
Applicants: Match literally Applicants:字面匹配
( Capture group 1 (捕获组 1
- .* Match the rest of the line .*匹配线的rest
- (?:\r?\n(?:Inventors.).*)* Match all lines that do not start with Inverntors: (?:\r?\n(?:Inventors.).*)*匹配所有不以 Invertors 开头的行：
) Close group )关闭组
\r?\nInventors: Match a newline and Inventors: \r?\nInventors:匹配换行符和 Inventors:

Regex demo |正则表达式演示| Python demo Python 演示

Example code示例代码

import re
text = ("PCT Filing Date: 2 December 2015\n"
    "Applicants: Silixa Ltd.\n"
    "Chevron U.S.A. Inc. (Incorporated\n"
    "in USA - California)\n"
    "Inventors: Farhadiroushan,\n"
    "Mahmoud\n"
    "Gillies, Arran\n"
    "Parker, Tom'")
regex = re.compile(r'^Applicants: (.*(?:\r?\n(?!Inventors:).*)*)\r?\nInventors:', re.MULTILINE)
print(regex.findall(text))

Output Output

['Silixa Ltd.\nChevron U.S.A. Inc. (Incorporated\nin USA - California)']

Answer 4

Here is a more general approach to parse a string like that into a dict of all the keys and values in it (ie, any string at the start of a line followed by a : is a key and the string following that key is data):这是一种更通用的方法，可以将这样的字符串解析为其中所有键和值的字典（即，行开头的任何字符串后跟:是键，该键后面的字符串是数据）：

import re 

txt="""\
PCT Filing Date: 2 December 2015
Applicants: Silixa Ltd.
Chevron U.S.A. Inc. (Incorporated
in USA - California)
Inventors: Farhadiroushan,
Mahmoud
Gillies, Arran
Parker, Tom'"""

pat=re.compile(r'(^[^\n:]+):[ \t]*([\s\S]*?(?=(?:^[^\n:]*:)|\Z))', flags=re.M)
data={m.group(1):m.group(2) for m in pat.finditer(txt)}

Result:结果：

>>> data
{'PCT Filing Date': '2 December 2015\n', 'Applicants': 'Silixa Ltd.\nChevron U.S.A. Inc. (Incorporated\nin USA - California)\n', 'Inventors': "Farhadiroushan,\nMahmoud\nGillies, Arran\nParker, Tom'"}

>>> data['Applicants']
'Silixa Ltd.\nChevron U.S.A. Inc. (Incorporated\nin USA - California)\n'

Demo of the regex正则表达式的演示

Python 正则表达式匹配多行字符串

问题描述

4 个解决方案

解决方案1
2 已采纳 2020-06-29 15:42:12

解决方案2
1 2020-06-29 15:44:19

解决方案3
1 2020-06-29 16:03:56

解决方案4
0 2020-06-29 16:13:29

Python 正则表达式匹配多行字符串

问题描述

4 个解决方案

解决方案1 2 已采纳 2020-06-29 15:42:12

解决方案2 1 2020-06-29 15:44:19

解决方案3 1 2020-06-29 16:03:56

解决方案4 0 2020-06-29 16:13:29

解决方案1
2 已采纳 2020-06-29 15:42:12

解决方案2
1 2020-06-29 15:44:19

解决方案3
1 2020-06-29 16:03:56

解决方案4
0 2020-06-29 16:13:29