[英]Match multiple patterns in a multiline string
我有一些看起來像這樣的數據:
PMID- 19587274
OWN - NLM
DP - 2009 Jul 8
TI - Domain general mechanisms of perceptual decision making in human cortex.
PG - 8675-87
AB - To successfully interact with objects in the environment, sensory evidence must
be continuously acquired, interpreted, and used to guide appropriate motor
responses. For example, when driving, a red
AD - Perception and Cognition Laboratory, Department of Psychology, University of
California, San Diego, La Jolla, California 92093, USA.
PMID- 19583148
OWN - NLM
DP - 2009 Jun
TI - Ursodeoxycholic acid for treatment of cholestasis in patients with hepatic
amyloidosis.
PG - 482-6
AB - BACKGROUND: Amyloidosis represents a group of different diseases characterized by
extracellular accumulation of pathologic fibrillar proteins in various tissues
AD - Asklepios Hospital, Department of Medicine, Langen, Germany.
innere2.longen@asklepios.com
我想寫一個正則表達式,可以匹配PMID,TI和AB后面的句子。
有可能在一口正則表達式中獲得這些嗎?
我花了整整一天的時間試圖找出一個正則表達式,而我能得到的最接近的是:
reg4 = r'PMID- (?P<pmid>[0-9]*).*TI.*- (?P<title>.*)PG.*AB.*- (?P<abstract>.*)AD'
for i in re.finditer(reg4, data, re.S | re.M): print i.groupdict()
僅在第二組“數據”中返回匹配項,而不是全部返回。
任何想法? 謝謝!
怎么樣:
import re
reg4 = re.compile(r'^(?:PMID- (?P<pmid>[0-9]+)|TI - (?P<title>.*?)^PG|AB - (?P<abstract>.*?)^AD)', re.MULTILINE | re.DOTALL)
for i in reg4.finditer(data):
print i.groupdict()
輸出:
{'pmid': '19587274', 'abstract': None, 'title': None}
{'pmid': None, 'abstract': None, 'title': 'Domain general mechanisms of perceptual decision making in human cortex.\n'}
{'pmid': None, 'abstract': 'To successfully interact with objects in the environment, sensory evidence must\n be continuously acquired, interpreted, and used to guide appropriate motor\n responses. For example, when driving, a red \n', 'title': None}
{'pmid': '19583148', 'abstract': None, 'title': None}
{'pmid': None, 'abstract': None, 'title': 'Ursodeoxycholic acid for treatment of cholestasis in patients with hepatic\n amyloidosis.\n'}
{'pmid': None, 'abstract': 'BACKGROUND: Amyloidosis represents a group of different diseases characterized by\n extracellular accumulation of pathologic fibrillar proteins in various tissues\n', 'title': None}
編輯
作為使它更易於理解的冗長的RE(我認為,除了最簡單的表達式外,應將冗長的RE用於其他任何事物,但這只是我的觀點!):
#!/usr/bin/python
import re
reg4 = re.compile(r'''
^ # Start of a line (due to re.MULTILINE, this may match at the start of any line)
(?: # Non capturing group with multiple options, first option:
PMID-\s # Literal "PMID-" followed by a space
(?P<pmid>[0-9]+) # Then a string of one or more digits, group as 'pmid'
| # Next option:
TI\s{2}-\s # "TI", two spaces, a hyphen and a space
(?P<title>.*?) # The title, a non greedy match that will capture everything up to...
^PG # The characters PG at the start of a line
| # Next option
AB\s{2}-\s # "AB - "
(?P<abstract>.*?) # The abstract, a non greedy match that will capture everything up to...
^AD # "AD" at the start of a line
)
''', re.MULTILINE | re.DOTALL | re.VERBOSE)
for i in reg4.finditer(data):
print i.groupdict()
請注意,您可以將^PG
和^AD
替換為^PG
^\\S
,以使其更通用(您要匹配所有內容,直到行首的第一個非空格為止)。
編輯2
如果要在一個正則表達式中掌握全部內容,請擺脫開頭(?:
,結尾)
並更改|
字符到.*?
:
#!/usr/bin/python
import re
reg4 = re.compile(r'''
^ # Start of a line (due to re.MULTILINE, this may match at the start of any line)
PMID-\s # Literal "PMID-" followed by a space
(?P<pmid>[0-9]+) # Then a string of one or more digits, group as 'pmid'
.*? # Next part:
TI\s{2}-\s # "TI", two spaces, a hyphen and a space
(?P<title>.*?) # The title, a non greedy match that will capture everything up to...
^PG # The characters PG at the start of a line
.*? # Next option
AB\s{2}-\s # "AB - "
(?P<abstract>.*?) # The abstract, a non greedy match that will capture everything up to...
^AD # "AD" at the start of a line
''', re.MULTILINE | re.DOTALL | re.VERBOSE)
for i in reg4.finditer(data):
print i.groupdict()
這給出:
{'pmid': '19587274', 'abstract': 'To successfully interact with objects in the environment, sensory evidence must\n be continuously acquired, interpreted, and used to guide appropriate motor\n responses. For example, when driving, a red \n', 'title': 'Domain general mechanisms of perceptual decision making in human cortex.\n'}
{'pmid': '19583148', 'abstract': 'BACKGROUND: Amyloidosis represents a group of different diseases characterized by\n extracellular accumulation of pathologic fibrillar proteins in various tissues\n', 'title': 'Ursodeoxycholic acid for treatment of cholestasis in patients with hepatic\n amyloidosis.\n'}
如何不使用正則表達式來執行此任務,而是使用按換行符分割的編程代碼,然后使用.startswith()等查看前綴代碼呢? 這樣的代碼會更長一些,但是每個人都可以理解它,而不必到stackoverflow尋求幫助。
問題是貪婪的預選賽。 這是一個更具體且不貪心的正則表達式:
#!/usr/bin/python
import re
from pprint import pprint
data = open("testdata.txt").read()
reg4 = r'''
^PMID # Start matching at the string PMID
\s*?- # As little whitespace as possible up to the next '-'
\s*? # As little whitespcase as possible
(?P<pmid>[0-9]+) # Capture the field "pmid", accepting only numeric characters
.*?TI # next, match any character up to the first occurrence of 'TI'
\s*?- # as little whitespace as possible up to the next '-'
\s*? # as little whitespace as possible
(?P<title>.*?)PG # capture the field <title> accepting any character up the the next occurrence of 'PG'
.*?AB # match any character up to the following occurrence of 'AB'
\s*?- # As little whitespace as possible up to the next '-'
\s*? # As little whitespcase as possible
(?P<abstract>.*?)AD # capture the fiels <abstract> accepting any character up to the next occurrence of 'AD'
'''
for i in re.finditer(reg4, data, re.S | re.M | re.VERBOSE):
print 78*"-"
pprint(i.groupdict())
輸出:
------------------------------------------------------------------------------
{'abstract': ' To successfully interact with objects in the environment,
sensory evidence must\n be continuously acquired, interpreted, and
used to guide appropriate motor\n responses. For example, when
driving, a red \n',
'pmid': '19587274',
'title': ' Domain general mechanisms of perceptual decision making in
human cortex.\n'}
------------------------------------------------------------------------------
{'abstract': ' BACKGROUND: Amyloidosis represents a group of different
diseases characterized by\n extracellular accumulation of pathologic
fibrillar proteins in various tissues\n',
'pmid': '19583148',
'title': ' Ursodeoxycholic acid for treatment of cholestasis in patients
with hepatic\n amyloidosis.\n'}
您可能要strip
掃描后的每個字段的空白。
另一個正則表達式:
reg4 = r'(?<=PMID- )(?P<pmid>.*?)(?=OWN - ).*?(?<=TI - )(?P<title>.*?)(?=PG - ).*?(?<=AB - )(?P<abstract>.*?)(?=AD - )'
如果行的順序可以更改,則可以使用以下模式:
reg4 = re.compile(r"""
^
(?: PMID \s*-\s* (?P<pmid> [0-9]+ ) \n
| TI \s*-\s* (?P<title> .* (?:\n[^\S\n].*)* ) \n
| AB \s*-\s* (?P<abstract> .* (?:\n[^\S\n].*)* ) \n
| .+\n
)+
""", re.MULTILINE | re.VERBOSE)
它將匹配連續的非空行,並捕獲PMID
, TI
和AB
項目。
項目值可以跨越多行,只要第一行之后的行以空格字符開頭即可。
[^\\S\\n]
”匹配除換行符( \\n
)之外的任何空白字符( \\s
)。 .* (?:\\n[^\\S\\n].*)*
”匹配以空格字符開頭的連續行。 .+\\n
”匹配任何其他非空行。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.