简体   繁体   English

Python 中的正则表达式:在以大写字母开头的两个单词之间分割空白字符

[英]Regex in Python: splitting on whitespace character in between two words that start with a capital letter

In my NLP pipeline, I need to split titles from body text.在我的 NLP 管道中,我需要将标题与正文分开。 Titles always consist of a sequence of capitalized words without any punctuation.标题总是由一系列大写的单词组成,没有任何标点符号。 The titles are separated from the body text using two whitespace characters \n\n .标题使用两个空格字符\n\n与正文分开。

For example:例如:

This Is A Title

This is where the body starts.

I want to split the title and body text on the whitespace using Regex in Python, such that the result is: This Is A Title, This is where the body starts.我想使用 Python 中的正则表达式在空白处拆分标题和正文文本,结果是:这是标题,这是正文开始的地方。

Can anybody help me to write the right Regex?有人可以帮我写正确的正则表达式吗? I tried the following:我尝试了以下方法:

r'(?<=[A-Z][a-z]+)\n\n(?=[A-Z])'

but then I got the error that lookbehinds only work with strings of fixed length (but they should be allowed to be variable).但后来我得到了一个错误,即lookbehinds仅适用于固定长度的字符串(但它们应该被允许是可变的)。

Many thanks for helping me out!非常感谢您帮助我!

You can match the title followed by 2 newlines, and for the body match all lines that are not a title pattern using 2 capture groups instead of splitting.您可以匹配标题后跟 2 个换行符,并且对于正文匹配不是标题模式的所有行,使用 2 个捕获组而不是拆分。

^([A-Z][a-z]*(?:[^\S\n]+[A-Z][a-z]*)*)\n\n((?:(?![A-Z][a-z]+(?:[^\S\n]+[A-Z][a-z]*)*$).*(?:\n|$))+)
  • ^ Start of string ^字符串开头
  • ( Capture group 1 (捕获组 1
    • [AZ][az]* Match an uppercase char and optional lower case chars to also match for example just A [AZ][az]*匹配大写字符和可选的小写字符也匹配例如A
    • (?:[^\S\n]+[AZ][az]*)* Optionally repeat 1+ spaces and the same pattern as before (?:[^\S\n]+[AZ][az]*)*可选地重复 1+ 个空格和与以前相同的模式
  • ) Close group )关闭组
  • \n\n Match 2 newlines \n\n匹配 2 个换行符
  • ( Capture group 2 (捕获组 2
    • (?: Non capture group (?:非捕获组
      • (??[AZ][az]+(::[^\S\n]+[AZ][az]*)*$) Negative lookahead, assert that the line is not a title pattern (??[AZ][az]+(::[^\S\n]+[AZ][az]*)*$)负前瞻,断言该行不是标题模式
      • .* If the previous assertion it true, match the whole line .*如果前面的断言为真,则匹配整行
      • (?:\n|$) Match either a newline or the end of the string (?:\n|$)匹配换行符或字符串的结尾
    • )+ Close the non capture group and repeat 1 or more times )+关闭非捕获组并重复 1 次或多次
  • ) Close group 2 )关闭第 2 组

See a regex demo and a Python demo .请参阅正则表达式演示Python 演示

import re

pattern = r"^([A-Z][a-z]*(?:[^\S\n]+[A-Z][a-z]*)*)\n\n((?:(?![A-Z][a-z]+(?:[^\S\n]+[A-Z][a-z]*)*$).*(?:\n|$))+)"

s = ("This Is A Title\n\n"
    "This is where the body starts.\n\n"
    "And this is more body.")
    
print(re.findall(pattern, s))

Output Output

[('This Is A Title', 'This is where the body starts.\n\nAnd this is more body.')]

Suppose you have this text:假设您有以下文本:

txt='''\
This Is A Title

This is where the body starts.
more body

Not a title -- body!

This Is Another Title

This is where the body starts.

The End
'''

You can use This Regex and separate titles (as you have defined them) from body:您可以使用此正则表达式并将标题(如您定义的那样)与正文分开:

import re
pat=r"((?=^(?:[A-Z][a-z]*[ \t]*)+$).*(?:\n\n|\n?\Z))|([\s\S]*?(?=^(?:[A-Z][a-z]*[ \t]*)+$))"

>>> re.findall(pat, txt, flags=re.M)
[('This Is A Title\n\n', ''), ('', 'This is where the body starts.\nmore body\n\nNot a title -- body!\n\n'), ('This Is Another Title\n\n', ''), ('', 'This is where the body starts.\n\n'), ('The End\n', '')]

As The fourth bird helpfully states in comments, the first lookahead can be eliminated:正如第四只鸟在评论中有益地指出的那样,可以消除第一个前瞻:

(^(?:[A-Z][a-z]*[ \t]*)+$)(?:\n\n|\n*\Z)|([\s\S]*?(?=^(?:[A-Z][a-z]*[ \t]*)+$))

Demo演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python 如果有大写字母,则将短语拆分为单词,但如果它们之间有逗号,则不拆分 - Python splitting a phrase into words if there is a capital letter, but not if there is a comma between them Pythonic句子拆分以大写字母开头的单词 - Pythonic sentence splitting on words starting with capital letter Python正则表达式仅删除两个1个字母单词之间的空格 - Python regex remove spaces only between two 1 letter words 在正则表达式中匹配两个单词之间的所有大写单词 - Matching in regex all capital words between two words 正则表达式将单词与首字母大写匹配 - Regex to match words with first capital letter 正则表达式 - 查找包含至少 1 个大写字母、一位数字或一个特殊字符的连续“单词” - Regex - Find successive 'words' containing at least 1 capital letter, one digit or one special character 如何使用正则表达式缩写所有以大写字母开头的单词 - How can I use Regex to abbreviate words that all start with a capital letter 使用正则表达式查找不是在句子开头的大写字母 - Find words with capital letters not at start of a sentence with regex Python Regex - 检查大写字母后面的大写字母 - Python Regex - checking for a capital letter with a lowercase after python 正则表达式查找小写字母后跟大写字母 - python regex find lowercase followed by capital letter
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM