简体   繁体   English

Python regex - 提取两个模式之间的所有匹配文本

[英]Python regex - Extract all the matching text between two patterns

I want to extract all the text in the bullet points numbered as 1.1, 1.2, 1.3 etc. Sometimes the bullet points can have space like 1. 1, 1. 2, 1.3, 1. 4我想提取编号为 1.1、1.2、1.3 等的项目符号中的所有文本。有时项目符号可以有空格,如 1.1、1.2、1.3、1.4

Sample text示范文本

    text = "some text before pattern 1.1 text_1_here  1.2 text_2_here  1 . 3 text_3_here  1. 4 text_4_here  1 .5 text_5_here 1.10 last_text_here 1.23 text after pattern"

For the text above, the output should be [' text_1_here ', ' text_2_here ', ' text_3_here ', ' text_4_here ', ' text_5_here ', ' last_text_here ']对于上面的文本,output 应该是 ['text_1_here', 'text_2_here', 'text_3_here', 'text_4_here', 'text_5_here', 'last_text_here']

I tried regex findall but not getting the required output. It is able to identify and extract 1.1 & 1.2 and then 1.3 & 1.4.我尝试了正则表达式 findall 但没有得到所需的 output。它能够识别和提取 1.1 和 1.2,然后是 1.3 和 1.4。 It is skipping text between 1.2 & 1.3.它正在跳过 1.2 和 1.3 之间的文本。

    import re
    re.findall(r'[0-9].\s?[0-9]+(.*?)[0-9].\s?[0-9]+', text)

I'm unsure about the exact rule why you'd want to exclude the last bit of text but based on your comments it seems we could also just split the entire text on the bullits and simply exclude the 1st and last element from the resulting array:我不确定您为什么要排除最后一段文本的确切规则,但根据您的评论,我们似乎也可以将整个文本拆分为 bullits 并简单地从结果数组中排除第一个和最后一个元素:

re.split(r'\s+\d(?:\s*\.\s*\d+)+\s+', text)[1:-1]

Which would output:哪个会是 output:

['text_1_here', 'text_2_here', 'text_3_here', 'text_4_here', 'text_5_here', 'last_text_here']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM