简体   繁体   English

Python 正则表达式匹配多行字符串

[英]Python regex matching multiline string

my_str:我的字符串:

PCT Filing Date: 2 December 2015
\nApplicants: Silixa Ltd.
\nChevron U.S.A. Inc. (Incorporated
in USA - California)
\nInventors: Farhadiroushan,
Mahmoud
\nGillies, Arran
Parker, Tom'

my code我的代码

regex = re.compile(r'(Applicants:)( )?(.*)', re.MULTILINE)
print(regex.findall(text))

my output:我的 output:

[('Applicants:', ' ', 'Silixa Ltd.')]

what I need is to get the string between 'Applicants:' and '\nInventors:'我需要的是获取 'Applicants:' 和 '\nInventors:' 之间的字符串

'Silixa Ltd.' & 'Chevron U.S.A. Inc. (Incorporated
in USA - California)'

Thanks in advance for your help在此先感谢您的帮助

Try using re.DOTALL instead:尝试使用 re.DOTALL 代替:

import re

text='''PCT Filing Date: 2 December 2015
\nApplicants: Silixa Ltd.
\nChevron U.S.A. Inc. (Incorporated
in USA - California)
\nInventors: Farhadiroushan,
Mahmoud
\nGillies, Arran
Parker, Tom'''

regex = re.compile(r'Applicants:(.*?)Inventors:', re.DOTALL)
print(regex.findall(text))

gives me给我

$ python test.py
[' Silixa Ltd.\n\nChevron U.S.A. Inc. (Incorporated\nin USA - California)\n\n']

The reason this works is that MULTILINE doesn't let the dot (.) match newlines, whereas DOTALL will.这样做的原因是 MULTILINE 不会让点 (.) 匹配换行符,而 DOTALL 会。

If what you want is the contents between Applicants: and \nInventors: , your regex should reflect that:如果您想要的是Applicants:\nInventors:之间的内容,则您的正则表达式应反映:

>>> regex = re.compile(r'Applicants: (.*)Inventors:', re.S)
>>> print(regex.findall(s))
['Silixa Ltd.\n\nChevron U.S.A. Inc. (Incorporated\nin USA - California)\n']

re.S is the "dot matches all" option, so our (.*) will also match new lines. re.S是“点匹配所有”选项,所以我们的(.*)也将匹配新行。 Note that this is different from re.MULTILINE , because re.MULTILINE only says that our expression should apply to multiple lines, but doesn't change the fact .请注意,这与re.MULTILINE不同,因为re.MULTILINE只表示我们的表达式应该应用于多行,但不会改变事实. will not match newlines.不会匹配换行符。 If .如果. doesn't match newlines, a match like (.*) will still stop at newlines, not achieving the multiline effect you want.不匹配换行符,像(.*)这样的匹配仍然会在换行符处停止,无法达到您想要的多行效果。

Also note that if you are not interested in Applicants: or Inventors: you may not want to put that between () , as in (Inventors:) in your regex, because the match will try to create a matching group for it.另请注意,如果您对Applicants:Inventors:不感兴趣,您可能不希望将其放在()之间,如(Inventors:)中的正则表达式,因为匹配将尝试为其创建匹配组。 That's the reason you got 3 elements in your output instead of just 1.这就是您在 output 中获得 3 个元素而不是只有 1 个的原因。

If you want to match all the text between \nApplicants: and \nInventors: , you could also get the match without using re.DOTALL preventing unnecessary backtracking.如果你想匹配 \nApplicants \nApplicants:\nInventors:之间的所有文本,你也可以在不使用re.DOTALL的情况下获得匹配,以防止不必要的回溯。

Match Applicants: and capture in group 1 the rest of that same line and all lines that follow that do not start with Inventors:匹配Applicants:并在第 1 组中捕获同一行的 rest 以及后面所有不以Inventors:

Then match Inventors.然后匹配发明家。

^Applicants: (.*(?:\r?\n(?!Inventors:).*)*)\r?\nInventors:
  • ^ Start of string (Or use \b if it does not have to be at the start) ^字符串的开头(如果不必在开头,则使用\b
  • Applicants: Match literally Applicants:字面匹配
  • ( Capture group 1 (捕获组 1
    • .* Match the rest of the line .*匹配线的rest
    • (?:\r?\n(?:Inventors.).*)* Match all lines that do not start with Inverntors: (?:\r?\n(?:Inventors.).*)*匹配所有不以 Invertors 开头的行:
  • ) Close group )关闭组
  • \r?\nInventors: Match a newline and Inventors: \r?\nInventors:匹配换行符和 Inventors:

Regex demo |正则表达式演示| Python demo Python 演示

Example code示例代码

import re
text = ("PCT Filing Date: 2 December 2015\n"
    "Applicants: Silixa Ltd.\n"
    "Chevron U.S.A. Inc. (Incorporated\n"
    "in USA - California)\n"
    "Inventors: Farhadiroushan,\n"
    "Mahmoud\n"
    "Gillies, Arran\n"
    "Parker, Tom'")
regex = re.compile(r'^Applicants: (.*(?:\r?\n(?!Inventors:).*)*)\r?\nInventors:', re.MULTILINE)
print(regex.findall(text))

Output Output

['Silixa Ltd.\nChevron U.S.A. Inc. (Incorporated\nin USA - California)']

Here is a more general approach to parse a string like that into a dict of all the keys and values in it (ie, any string at the start of a line followed by a : is a key and the string following that key is data):这是一种更通用的方法,可以将这样的字符串解析为其中所有键和值的字典(即,行开头的任何字符串后跟:是键,该键后面的字符串是数据) :

import re 

txt="""\
PCT Filing Date: 2 December 2015
Applicants: Silixa Ltd.
Chevron U.S.A. Inc. (Incorporated
in USA - California)
Inventors: Farhadiroushan,
Mahmoud
Gillies, Arran
Parker, Tom'"""

pat=re.compile(r'(^[^\n:]+):[ \t]*([\s\S]*?(?=(?:^[^\n:]*:)|\Z))', flags=re.M)
data={m.group(1):m.group(2) for m in pat.finditer(txt)}

Result:结果:

>>> data
{'PCT Filing Date': '2 December 2015\n', 'Applicants': 'Silixa Ltd.\nChevron U.S.A. Inc. (Incorporated\nin USA - California)\n', 'Inventors': "Farhadiroushan,\nMahmoud\nGillies, Arran\nParker, Tom'"}

>>> data['Applicants']
'Silixa Ltd.\nChevron U.S.A. Inc. (Incorporated\nin USA - California)\n'

Demo of the regex正则表达式的演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM