Python 中的正則表達式：在以大寫字母開頭的兩個單詞之間分割空白字符

Question

在我的 NLP 管道中，我需要將標題與正文分開。 標題總是由一系列大寫的單詞組成，沒有任何標點符號。 標題使用兩個空格字符\n\n與正文分開。

例如：

This Is A Title

This is where the body starts.

我想使用 Python 中的正則表達式在空白處拆分標題和正文文本，結果是：這是標題，這是正文開始的地方。

有人可以幫我寫正確的正則表達式嗎？ 我嘗試了以下方法：

r'(?<=[A-Z][a-z]+)\n\n(?=[A-Z])'

但后來我得到了一個錯誤，即lookbehinds僅適用於固定長度的字符串（但它們應該被允許是可變的）。

非常感謝您幫助我！

Answer 1

您可以匹配標題后跟 2 個換行符，並且對於正文匹配不是標題模式的所有行，使用 2 個捕獲組而不是拆分。

^([A-Z][a-z]*(?:[^\S\n]+[A-Z][a-z]*)*)\n\n((?:(?![A-Z][a-z]+(?:[^\S\n]+[A-Z][a-z]*)*$).*(?:\n|$))+)

^字符串開頭
(捕獲組 1
- [AZ][az]*匹配大寫字符和可選的小寫字符也匹配例如A
- (?:[^\S\n]+[AZ][az]*)*可選地重復 1+ 個空格和與以前相同的模式
)關閉組
\n\n匹配 2 個換行符
(捕獲組 2
- (?:非捕獲組
  - (??[AZ][az]+(::[^\S\n]+[AZ][az]*)*$)負前瞻，斷言該行不是標題模式
  - .*如果前面的斷言為真，則匹配整行
  - (?:\n|$)匹配換行符或字符串的結尾
- )+關閉非捕獲組並重復 1 次或多次
)關閉第 2 組

請參閱正則表達式演示和Python 演示。

import re

pattern = r"^([A-Z][a-z]*(?:[^\S\n]+[A-Z][a-z]*)*)\n\n((?:(?![A-Z][a-z]+(?:[^\S\n]+[A-Z][a-z]*)*$).*(?:\n|$))+)"

s = ("This Is A Title\n\n"
    "This is where the body starts.\n\n"
    "And this is more body.")
    
print(re.findall(pattern, s))

Output

[('This Is A Title', 'This is where the body starts.\n\nAnd this is more body.')]

Answer 2

假設您有以下文本：

txt='''\
This Is A Title

This is where the body starts.
more body

Not a title -- body!

This Is Another Title

This is where the body starts.

The End
'''

您可以使用此正則表達式並將標題（如您定義的那樣）與正文分開：

import re
pat=r"((?=^(?:[A-Z][a-z]*[ \t]*)+$).*(?:\n\n|\n?\Z))|([\s\S]*?(?=^(?:[A-Z][a-z]*[ \t]*)+$))"

>>> re.findall(pat, txt, flags=re.M)
[('This Is A Title\n\n', ''), ('', 'This is where the body starts.\nmore body\n\nNot a title -- body!\n\n'), ('This Is Another Title\n\n', ''), ('', 'This is where the body starts.\n\n'), ('The End\n', '')]

正如第四只鳥在評論中有益地指出的那樣，可以消除第一個前瞻：

(^(?:[A-Z][a-z]*[ \t]*)+$)(?:\n\n|\n*\Z)|([\s\S]*?(?=^(?:[A-Z][a-z]*[ \t]*)+$))

演示

Python 中的正則表達式：在以大寫字母開頭的兩個單詞之間分割空白字符

問題描述

2 個解決方案

解決方案1
2 2021-12-10 14:35:45

解決方案2
1 2021-12-10 15:22:43

Python 中的正則表達式：在以大寫字母開頭的兩個單詞之間分割空白字符

問題描述

2 個解決方案

解決方案1 2 2021-12-10 14:35:45

解決方案2 1 2021-12-10 15:22:43

解決方案1
2 2021-12-10 14:35:45

解決方案2
1 2021-12-10 15:22:43