Python 中的正则表达式：在以大写字母开头的两个单词之间分割空白字符

Question

在我的 NLP 管道中，我需要将标题与正文分开。 标题总是由一系列大写的单词组成，没有任何标点符号。 标题使用两个空格字符\n\n与正文分开。

例如：

This Is A Title

This is where the body starts.

我想使用 Python 中的正则表达式在空白处拆分标题和正文文本，结果是：这是标题，这是正文开始的地方。

有人可以帮我写正确的正则表达式吗？ 我尝试了以下方法：

r'(?<=[A-Z][a-z]+)\n\n(?=[A-Z])'

但后来我得到了一个错误，即lookbehinds仅适用于固定长度的字符串（但它们应该被允许是可变的）。

非常感谢您帮助我！

Answer 1

您可以匹配标题后跟 2 个换行符，并且对于正文匹配不是标题模式的所有行，使用 2 个捕获组而不是拆分。

^([A-Z][a-z]*(?:[^\S\n]+[A-Z][a-z]*)*)\n\n((?:(?![A-Z][a-z]+(?:[^\S\n]+[A-Z][a-z]*)*$).*(?:\n|$))+)

^字符串开头
(捕获组 1
- [AZ][az]*匹配大写字符和可选的小写字符也匹配例如A
- (?:[^\S\n]+[AZ][az]*)*可选地重复 1+ 个空格和与以前相同的模式
)关闭组
\n\n匹配 2 个换行符
(捕获组 2
- (?:非捕获组
  - (??[AZ][az]+(::[^\S\n]+[AZ][az]*)*$)负前瞻，断言该行不是标题模式
  - .*如果前面的断言为真，则匹配整行
  - (?:\n|$)匹配换行符或字符串的结尾
- )+关闭非捕获组并重复 1 次或多次
)关闭第 2 组

请参阅正则表达式演示和Python 演示。

import re

pattern = r"^([A-Z][a-z]*(?:[^\S\n]+[A-Z][a-z]*)*)\n\n((?:(?![A-Z][a-z]+(?:[^\S\n]+[A-Z][a-z]*)*$).*(?:\n|$))+)"

s = ("This Is A Title\n\n"
    "This is where the body starts.\n\n"
    "And this is more body.")
    
print(re.findall(pattern, s))

Output

[('This Is A Title', 'This is where the body starts.\n\nAnd this is more body.')]

Answer 2

假设您有以下文本：

txt='''\
This Is A Title

This is where the body starts.
more body

Not a title -- body!

This Is Another Title

This is where the body starts.

The End
'''

您可以使用此正则表达式并将标题（如您定义的那样）与正文分开：

import re
pat=r"((?=^(?:[A-Z][a-z]*[ \t]*)+$).*(?:\n\n|\n?\Z))|([\s\S]*?(?=^(?:[A-Z][a-z]*[ \t]*)+$))"

>>> re.findall(pat, txt, flags=re.M)
[('This Is A Title\n\n', ''), ('', 'This is where the body starts.\nmore body\n\nNot a title -- body!\n\n'), ('This Is Another Title\n\n', ''), ('', 'This is where the body starts.\n\n'), ('The End\n', '')]

正如第四只鸟在评论中有益地指出的那样，可以消除第一个前瞻：

(^(?:[A-Z][a-z]*[ \t]*)+$)(?:\n\n|\n*\Z)|([\s\S]*?(?=^(?:[A-Z][a-z]*[ \t]*)+$))

演示

Python 中的正则表达式：在以大写字母开头的两个单词之间分割空白字符

问题描述

2 个解决方案

解决方案1
2 2021-12-10 14:35:45

解决方案2
1 2021-12-10 15:22:43

Python 中的正则表达式：在以大写字母开头的两个单词之间分割空白字符

问题描述

2 个解决方案

解决方案1 2 2021-12-10 14:35:45

解决方案2 1 2021-12-10 15:22:43

解决方案1
2 2021-12-10 14:35:45

解决方案2
1 2021-12-10 15:22:43