[英]Regex extract header and text separately from a paragraph
I wanted to separate out the header and it's corresponding text, separated by a delimiter colon
from a paragraph.我想将标题和它的相应文本分开,用分隔符colon
与段落分隔。
Example Paragraph , "INCIDENTS: Quick fox ran over. A plane drove the head. RESULT AND CONCLUSION: I got headache, and fever"示例段落,“事件:快狐跑了过来。飞机撞了头。结果和结论:我头疼,发烧”
Output I expect : [('INCIDENTS', 'Quick fox ran over. A plane drove the head'), ('RESULT AND CONCLUSION', 'I got headache, and fever')]我期望的输出:[('INCIDENTS', 'Quick fox run over. Aplane drive the head'), ('RESULT AND CONCLUSION', '我头疼,发烧')]
I am using python and tried with re.findall(r'([AZ]+:)(.*?)\\.', <paragraph>)
.我正在使用 python 并尝试使用re.findall(r'([AZ]+:)(.*?)\\.', <paragraph>)
。 But I haven't got the expected output.但我没有得到预期的输出。
Any help is appreciated.....任何帮助表示赞赏.....
You can use您可以使用
re.findall(r'\b([A-Z]+(?:\s+[A-Z]+)*):\s*(.*?)(?=\s*\b(?:[A-Z]+(?:\s+[A-Z]+)*):|$)', text)
See the regex demo查看正则表达式演示
Details细节
\\b
- word boundary \\b
- 词边界([AZ]+(?:\\s+[AZ]+)*)
- Group 1: an uppercase word and then zero or more whitespace-separated uppercase words ([AZ]+(?:\\s+[AZ]+)*)
- 第 1 组:一个大写单词,然后是零个或多个空格分隔的大写单词:
- a colon :
- 一个冒号\\s*
- 0 or more whitespaces \\s*
- 0 个或多个空格(.*?)
- Group 2: any zero or more chars as few as possible (.*?)
- 第 2 组:尽可能少的零个或多个字符(?=\\s*\\b(?:[AZ]+(?:\\s+[AZ]+)*):|$)
- up to the 0 or more whitespaces, a word boundary, an uppercase word and then zero or more whitespace-separated uppercase words, or end of string. (?=\\s*\\b(?:[AZ]+(?:\\s+[AZ]+)*):|$)
- 最多 0 个或多个空格、一个单词边界、一个大写单词然后是零或更多以空格分隔的大写单词,或字符串结尾。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.