python regex匹配以标签开头的段落

Question

我正在尝试匹配以字母开头的一个或多个段落。 我正在测试，并尝试使用dotALL，lookaheads，multiline等，但似乎无法正常工作。 我要匹配的字符串如下所示：

      A-B:  Object, procedure:
      - Somethings.
      - More things, might run over several lines like this where the sentence just keeps on going and going and going and sometimes isn't even a sentence.
      - Another line, sometimes not ending with period
      - Variable amount of white space at the beginning of new lines

       Comment (A-B): sometimes, there are comments which are separated by two \n\n characters like this.*

      C.  Second object, other procedure:
      - More lines.
      - Can have various leads (including no ' - ' leading.
      - Variable number of lines.

我最接近的比赛是使用'（。+？\\ n \\ n |。+？$）'和dotALL（我意识到这很草率），但是即使这样也没有用，因为它错过了注释或段落以更多行分隔，但仍位于标题（[AZ]？-？[AZ]）下。

理想情况下，我想捕获match.group（1）中的标题或标题（AB :)或（C.），以及match.group（2）中下一个标题之前的其余段落。只是乐于捕捉一切。 我尝试先行捕捉标题之间的所有内容，但是错过了最后一个没有标题的最后一个实例。

我是新手，如果这个问题已经得到回答或不清楚，我深表歉意（但是我一直在寻找过去两天没有成功的信息）。 谢谢！

Answer 1

所以这是我为您建议的解决方案:)

import re
with open('./samplestring.txt') as f:
    header =[]
    nonheader = []
    yourString = f.read()
    for line in content.splitlines():
        if(re.match('(^[A-Z]?-?[A-Z]:)|(^[A-Z]\.)',line.lstrip())):
            header.append(line)
        else:
            nonheader.append(line)

Answer 2

我最终放弃了捕获评论及其后的所有内容。 我使用以下代码捕获每个标头（group（1））的字母，标头的文本（group（2））以及段落中不包含注释的文本（group（3））。

（[AZ] {1,2} | [AZ]-[AZ]）（？:: |。）+（\\ w。+）\\ n +（（\\ s *（-*。+））+）

（[AZ] {1,2} | [AZ]-[AZ]）（？:: |。）+捕获字母（第1组），冒号或句号以及其后的空格

（\\ w。+）\\ n +捕获标题文本和下一行

（（\\ s *（-*。+））+）捕获多行以空格，破折号，空格和文本开头的行

感谢您的协助！ :)

Answer 3

您可以使用

(^[^\n]+)(?:\n *-.+(?:\n.+)*|\n\n.+\n)+

(^[^\\n]+) -匹配标题行，然后在
\\n *-.+(?:\\n.+)* -非注释行：以空格开头，后跟- ，可以跨多行运行
\\n\\n.+\\n或匹配注释行

（没有dotall标志）

https://regex101.com/r/6kle0u/2

这取决于注释行之前总是有\\n\\n 。

python regex匹配以标签开头的段落

问题描述

3 个解决方案

解决方案1
0 2019-01-31 23:04:57

解决方案2
0 2019-02-01 20:42:19

解决方案3
-1 已采纳 2019-01-31 22:52:44

python regex匹配以标签开头的段落

问题描述

3 个解决方案

解决方案1 0 2019-01-31 23:04:57

解决方案2 0 2019-02-01 20:42:19

解决方案3 -1 已采纳 2019-01-31 22:52:44

解决方案1
0 2019-01-31 23:04:57

解决方案2
0 2019-02-01 20:42:19

解决方案3
-1 已采纳 2019-01-31 22:52:44