具有組織模式文件的Python多行正則表達式

Question

使用正則表達式，我想從Emacs組織模式文件中提取某些部分，這些文件是簡單的文本文件。 這些組織文件中的條目以*開頭，有時這些條目確實具有屬性。 可以在下面找到一個簡短的示例：

import re

orgfiletest = """
* headline 0
* headline 1
  :PROPERTIES:
  :KEY: lala
  :END:
* headline 2
* headline 3
  :PROPERTIES:
  :KEY: lblb
  :END:
"""

我想提取所有具有屬性的條目； 提取的條目應包括這些屬性。 因此，我希望收到以下文本：

* headline 1
  :PROPERTIES:
  :KEY: lala
  :END:

和

* headline 3
  :PROPERTIES:
  :KEY: lblb
  :END:

我從這樣的事情開始

re.findall(r"\*.*\s:END:", orgfiletest, re.DOTALL)

但這還包括headline 0和headline 2 ，它們沒有任何屬性。 我的下一個嘗試是利用環顧四周，但無濟於事。 任何幫助深表感謝！

適用於我的更新/解決方案：

感謝所有幫助我找到解決方案的人！ 為了將來參考，我提供了更新的MWE和適用於我的正則表達式：

import re
orgfiletest = """
* headline 0
  more text 
* headline 1
  :PROPERTIES:
  :KEY: lala
  :END:
* headline foo 2
** bar 3
  :PROPERTIES:
  :KEY: lblb
  :FOOBAR: lblb
  :END:
* new headline
  more text
"""

re.findall(r"^\*+ .+[\r\n](?:(?!\*)\s*:.+[\r\n]?)+", orgfiletest, re.MULTILINE)

Answer 1

有兩種可能性，包括非正則表達式解決方案。
正如您特別要求的那樣：

^\*\ headline\ \d+[\r\n] # look for "* headline digit(s) and newline
(?:(?!\*).+[\r\n]?)+     # followed by NOT a newline at the beginning
                         # ... anything else including newlines afterwards
                         # ... at least once

在regex101.com上觀看演示 （請注意修飾符x和m ！）

在Python這將是：

 import re rx = re.compile(r''' ^\\*\\ headline\\ \\d+[\\r\\n] (?:(?!\\*).+[\\r\\n]?)+ ''', re.VERBOSE | re.MULTILINE) print(rx.findall(orgfiletest))

一種非正則表達式的方式可能是（使用itertools ）：

 from itertools import groupby result = {}; key = None for k, v in groupby( orgfiletest.split("\\n"), lambda line: line.startswith('* headline')): if k: item = list(v) key = item[len(item)-1] elif key is not None: result[key] = list(v) print(result) # {'* headline 1': [' :PROPERTIES:', ' :KEY: lala', ' :END:'], '* headline 3': [' :PROPERTIES:', ' :KEY: lblb', ' :END:', '']}

不利之處在於，也將使用以* headline abc或* headliner***開頭的行。 老實說，我會在這里使用regex解決方案。

Answer 2

我想你可以這樣做。 僅匹配包含屬性的記錄

(?ms)^\\*(?:(?!^\\*).)*?PROPERTIES(?:(?!^\\*).)*

https://regex101.com/r/oZcos0/1

講解

 (?ms)                 # Inline modifiers:  Multi-line, Dot-all
 ^ \*                  # Start record: BOL plus *
 (?:                   # Minimal matching
      (?! ^ \* )            # Not a new record
      . 
 )*?
 PROPERTIES            # Up to prop
 (?:                   # Max matching up to begin new record
      (?! ^ \* )            # Not a new record
      . 
 )*

Answer 3

嘗試制作可讀的正則表達式：

^\*\sheadline(?:(?!^\*\sheadline).)*:END:$

^\\*\\sheadline >已知項目是這樣開始的。

(?:(?!^\\*\\sheadline).)* ->匹配任何內容，只要不包括我們如何知道新項目的開始。

:END:$ ->它在一行的末尾包含一個已知的end語句。

工作演示。

具有組織模式文件的Python多行正則表達式

問題描述

3 個解決方案

解決方案1
2 已采納 2017-08-11 18:50:28

解決方案2
1

解決方案3
1 2017-08-11 19:38:29

具有組織模式文件的Python多行正則表達式

問題描述

3 個解決方案

解決方案1 2 已采納 2017-08-11 18:50:28

解決方案2 1

解決方案3 1 2017-08-11 19:38:29

解決方案1
2 已采納 2017-08-11 18:50:28

解決方案2
1

解決方案3
1 2017-08-11 19:38:29