匹配多行字符串中的多個模式

Question

我有一些看起來像這樣的數據：

PMID- 19587274
OWN - NLM
DP  - 2009 Jul 8
TI  - Domain general mechanisms of perceptual decision making in human cortex.
PG  - 8675-87
AB  - To successfully interact with objects in the environment, sensory evidence must
      be continuously acquired, interpreted, and used to guide appropriate motor
      responses. For example, when driving, a red 
AD  - Perception and Cognition Laboratory, Department of Psychology, University of
      California, San Diego, La Jolla, California 92093, USA.

PMID- 19583148
OWN - NLM
DP  - 2009 Jun
TI  - Ursodeoxycholic acid for treatment of cholestasis in patients with hepatic
      amyloidosis.
PG  - 482-6
AB  - BACKGROUND: Amyloidosis represents a group of different diseases characterized by
      extracellular accumulation of pathologic fibrillar proteins in various tissues
AD  - Asklepios Hospital, Department of Medicine, Langen, Germany.
      innere2.longen@asklepios.com

我想寫一個正則表達式，可以匹配PMID，TI和AB后面的句子。

有可能在一口正則表達式中獲得這些嗎？

我花了整整一天的時間試圖找出一個正則表達式，而我能得到的最接近的是：

reg4 = r'PMID- (?P<pmid>[0-9]*).*TI.*- (?P<title>.*)PG.*AB.*- (?P<abstract>.*)AD'
for i in re.finditer(reg4, data, re.S | re.M): print i.groupdict()

僅在第二組“數據”中返回匹配項，而不是全部返回。

任何想法？ 謝謝！

Answer 1

怎么樣：

import re
reg4 = re.compile(r'^(?:PMID- (?P<pmid>[0-9]+)|TI  - (?P<title>.*?)^PG|AB  - (?P<abstract>.*?)^AD)', re.MULTILINE | re.DOTALL)
for i in reg4.finditer(data):
    print i.groupdict()

輸出：

{'pmid': '19587274', 'abstract': None, 'title': None}
{'pmid': None, 'abstract': None, 'title': 'Domain general mechanisms of perceptual decision making in human cortex.\n'}
{'pmid': None, 'abstract': 'To successfully interact with objects in the environment, sensory evidence must\n      be continuously acquired, interpreted, and used to guide appropriate motor\n      responses. For example, when driving, a red \n', 'title': None}
{'pmid': '19583148', 'abstract': None, 'title': None}
{'pmid': None, 'abstract': None, 'title': 'Ursodeoxycholic acid for treatment of cholestasis in patients with hepatic\n      amyloidosis.\n'}
{'pmid': None, 'abstract': 'BACKGROUND: Amyloidosis represents a group of different diseases characterized by\n      extracellular accumulation of pathologic fibrillar proteins in various tissues\n', 'title': None}

編輯

作為使它更易於理解的冗長的RE（我認為，除了最簡單的表達式外，應將冗長的RE用於其他任何事物，但這只是我的觀點！）：

#!/usr/bin/python
import re
reg4 = re.compile(r'''
        ^                     # Start of a line (due to re.MULTILINE, this may match at the start of any line)
        (?:                   # Non capturing group with multiple options, first option:
            PMID-\s           # Literal "PMID-" followed by a space
            (?P<pmid>[0-9]+)  # Then a string of one or more digits, group as 'pmid'
        |                     # Next option:
            TI\s{2}-\s        # "TI", two spaces, a hyphen and a space
            (?P<title>.*?)    # The title, a non greedy match that will capture everything up to...
            ^PG               # The characters PG at the start of a line
        |                     # Next option
            AB\s{2}-\s        # "AB  - "
            (?P<abstract>.*?) # The abstract, a non greedy match that will capture everything up to...
            ^AD               # "AD" at the start of a line
        )
        ''', re.MULTILINE | re.DOTALL | re.VERBOSE)
for i in reg4.finditer(data):
    print i.groupdict()

請注意，您可以將^PG和^AD替換為^PG ^\\S ，以使其更通用（您要匹配所有內容，直到行首的第一個非空格為止）。

編輯2

如果要在一個正則表達式中掌握全部內容，請擺脫開頭(?: ，結尾)並更改| 字符到.*? ：

#!/usr/bin/python
import re
reg4 = re.compile(r'''
        ^                 # Start of a line (due to re.MULTILINE, this may match at the start of any line)
        PMID-\s           # Literal "PMID-" followed by a space
        (?P<pmid>[0-9]+)  # Then a string of one or more digits, group as 'pmid'
        .*?               # Next part:
        TI\s{2}-\s        # "TI", two spaces, a hyphen and a space
        (?P<title>.*?)    # The title, a non greedy match that will capture everything up to...
        ^PG               # The characters PG at the start of a line
        .*?               # Next option
        AB\s{2}-\s        # "AB  - "
        (?P<abstract>.*?) # The abstract, a non greedy match that will capture everything up to...
        ^AD               # "AD" at the start of a line
        ''', re.MULTILINE | re.DOTALL | re.VERBOSE)
for i in reg4.finditer(data):
    print i.groupdict()

這給出：

{'pmid': '19587274', 'abstract': 'To successfully interact with objects in the environment, sensory evidence must\n      be continuously acquired, interpreted, and used to guide appropriate motor\n      responses. For example, when driving, a red \n', 'title': 'Domain general mechanisms of perceptual decision making in human cortex.\n'}
{'pmid': '19583148', 'abstract': 'BACKGROUND: Amyloidosis represents a group of different diseases characterized by\n      extracellular accumulation of pathologic fibrillar proteins in various tissues\n', 'title': 'Ursodeoxycholic acid for treatment of cholestasis in patients with hepatic\n      amyloidosis.\n'}

Answer 2

如何不使用正則表達式來執行此任務，而是使用按換行符分割的編程代碼，然后使用.startswith（）等查看前綴代碼呢？ 這樣的代碼會更長一些，但是每個人都可以理解它，而不必到stackoverflow尋求幫助。

Answer 3

問題是貪婪的預選賽。 這是一個更具體且不貪心的正則表達式：

#!/usr/bin/python
import re
from pprint import pprint
data = open("testdata.txt").read()

reg4 = r'''
   ^PMID               # Start matching at the string PMID
   \s*?-               # As little whitespace as possible up to the next '-'
   \s*?                # As little whitespcase as possible
   (?P<pmid>[0-9]+)    # Capture the field "pmid", accepting only numeric characters
   .*?TI               # next, match any character up to the first occurrence of 'TI'
   \s*?-               # as little whitespace as possible up to the next '-'
   \s*?                # as little whitespace as possible
   (?P<title>.*?)PG    # capture the field <title> accepting any character up the the next occurrence of 'PG'
   .*?AB               # match any character up to the following occurrence of 'AB'
   \s*?-               # As little whitespace as possible up to the next '-'
   \s*?                # As little whitespcase as possible
   (?P<abstract>.*?)AD # capture the fiels <abstract> accepting any character up to the next occurrence of 'AD'
'''
for i in re.finditer(reg4, data, re.S | re.M | re.VERBOSE):
   print 78*"-"
   pprint(i.groupdict())

輸出：

------------------------------------------------------------------------------
{'abstract': ' To successfully interact with objects in the environment,
   sensory evidence must\n      be continuously acquired, interpreted, and
   used to guide appropriate motor\n      responses. For example, when
   driving, a red \n',
 'pmid': '19587274',
 'title': ' Domain general mechanisms of perceptual decision making in
    human cortex.\n'}
------------------------------------------------------------------------------
{'abstract': ' BACKGROUND: Amyloidosis represents a group of different
   diseases characterized by\n      extracellular accumulation of pathologic
   fibrillar proteins in various tissues\n',
 'pmid': '19583148',
 'title': ' Ursodeoxycholic acid for treatment of cholestasis in patients
    with hepatic\n      amyloidosis.\n'}

您可能要strip掃描后的每個字段的空白。

Answer 4

另一個正則表達式：

reg4 = r'(?<=PMID- )(?P<pmid>.*?)(?=OWN - ).*?(?<=TI  - )(?P<title>.*?)(?=PG  - ).*?(?<=AB  - )(?P<abstract>.*?)(?=AD  - )'

Answer 5

如果行的順序可以更改，則可以使用以下模式：

reg4 = re.compile(r"""
    ^
    (?: PMID \s*-\s* (?P<pmid> [0-9]+ ) \n
     |  TI   \s*-\s* (?P<title> .* (?:\n[^\S\n].*)* ) \n
     |  AB   \s*-\s* (?P<abstract> .* (?:\n[^\S\n].*)* ) \n
     |  .+\n
     )+
""", re.MULTILINE | re.VERBOSE)

它將匹配連續的非空行，並捕獲PMID ， TI和AB項目。

項目值可以跨越多行，只要第一行之后的行以空格字符開頭即可。

“ [^\\S\\n] ”匹配除換行符（ \\n ）之外的任何空白字符（ \\s ）。
“ .* (?:\\n[^\\S\\n].*)* ”匹配以空格字符開頭的連續行。
“ .+\\n ”匹配任何其他非空行。

匹配多行字符串中的多個模式

問題描述

5 個解決方案

解決方案1
2 已采納 2009-09-01 09:14:15

解決方案2
2 2009-09-01 10:02:20

解決方案3
0 2009-09-01 09:22:03

解決方案4
0 2009-09-01 09:33:01

解決方案5
0 2009-09-01 10:23:10

匹配多行字符串中的多個模式

問題描述

5 個解決方案

解決方案1 2 已采納 2009-09-01 09:14:15

解決方案2 2 2009-09-01 10:02:20

解決方案3 0 2009-09-01 09:22:03

解決方案4 0 2009-09-01 09:33:01

解決方案5 0 2009-09-01 10:23:10

解決方案1
2 已采納 2009-09-01 09:14:15

解決方案2
2 2009-09-01 10:02:20

解決方案3
0 2009-09-01 09:22:03

解決方案4
0 2009-09-01 09:33:01

解決方案5
0 2009-09-01 10:23:10