简体   繁体   English

正则表达式从段落中分别提取标题和文本

[英]Regex extract header and text separately from a paragraph

I wanted to separate out the header and it's corresponding text, separated by a delimiter colon from a paragraph.我想将标题和它的相应文本分开,用分隔符colon与段落分隔。

Example Paragraph , "INCIDENTS: Quick fox ran over. A plane drove the head. RESULT AND CONCLUSION: I got headache, and fever"示例段落,“事件:快狐跑了过来。飞机撞了头。结果和结论:我头疼,发烧”

Output I expect : [('INCIDENTS', 'Quick fox ran over. A plane drove the head'), ('RESULT AND CONCLUSION', 'I got headache, and fever')]我期望的输出:[('INCIDENTS', 'Quick fox run over. Aplane drive the head'), ('RESULT AND CONCLUSION', '我头疼,发烧')]

I am using python and tried with re.findall(r'([AZ]+:)(.*?)\\.', <paragraph>) .我正在使用 python 并尝试使用re.findall(r'([AZ]+:)(.*?)\\.', <paragraph>) But I haven't got the expected output.但我没有得到预期的输出。

Any help is appreciated.....任何帮助表示赞赏.....

You can use您可以使用

re.findall(r'\b([A-Z]+(?:\s+[A-Z]+)*):\s*(.*?)(?=\s*\b(?:[A-Z]+(?:\s+[A-Z]+)*):|$)', text)

See the regex demo查看正则表达式演示

Details细节

  • \\b - word boundary \\b - 词边界
  • ([AZ]+(?:\\s+[AZ]+)*) - Group 1: an uppercase word and then zero or more whitespace-separated uppercase words ([AZ]+(?:\\s+[AZ]+)*) - 第 1 组:一个大写单词,然后是零个或多个空格分隔的大写单词
  • : - a colon : - 一个冒号
  • \\s* - 0 or more whitespaces \\s* - 0 个或多个空格
  • (.*?) - Group 2: any zero or more chars as few as possible (.*?) - 第 2 组:尽可能少的零个或多个字符
  • (?=\\s*\\b(?:[AZ]+(?:\\s+[AZ]+)*):|$) - up to the 0 or more whitespaces, a word boundary, an uppercase word and then zero or more whitespace-separated uppercase words, or end of string. (?=\\s*\\b(?:[AZ]+(?:\\s+[AZ]+)*):|$) - 最多 0 个或多个空格、一个单词边界、一个大写单词然后是零或更多以空格分隔的大写单词,或字符串结尾。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM