繁体   English   中英

从字符串python正则表达式中提取匹配组

[英]Extract matching groups from string python regex

我正在尝试从Python字符串中提取匹配组,但遇到了问题。

该字符串如下所示。

1. TITLE ABC Contents of title ABC and some other text 2. TITLE BCD This would have contents on 
title BCD and maybe something else 3. TITLE CDC Contents of title cdc

而且我需要以数字和大写字母开头的任何内容作为标题,并提取该标题中的内容。

这是我期望的输出。

1. TITLE ABC Contents of title ABC and some other text
2. TITLE BCD This would have contents on title BCD and maybe something else 
3. TITLE CDC Contents of title cdc

我尝试了以下正则表达式

(\d\.\s[A-Z\s]*\s)

并获得以下内容。

1. TITLE ABC 
2. TITLE BCD 
3. TITLE CDC

如果我尝试在正则表达式的末尾添加。*,则匹配组会受到影响。 我想我在这里缺少一些简单的东西。 尝试了我所知道但无法解决的所有问题。

感谢您的帮助。

使用(\\d+\\.[\\da-z]* [AZ]+[\\S\\s]*?(?=\\d+\\.|$))

以下是相关代码

import re
text = """1. TITLE ABC Contents of title ABC and some other text 2. TITLE BCD This would have contents on
title BCD and maybe something else 3. TITLE CDC Contents of title cdc"""

result = re.findall('('
                    '\d+\.'   # Match a number and a '.' character
                    '[\da-z]*' # If present include any additional numbers/letters
                    '(?:\.[\da-z])*' # Match additional subpoints.
                                     # Each of these subpoints must start with a '.'
                                     # And then have any combination of numbers/letters
                    ' '   # Match a space. This is how we know to stop looking for subpoints, 
                          # and to start looking for capital letters
                    '[A-Z]+'  # Match at least one capital letter. 
                              # Use [A-Z]{2,} to match 2 or more capital letters
                    '[\S\s]*?'  # Match everything including newlines.
                                # Use .*? if you don't care about matching newlines
                    '(?=\d+\.|$)'  # Stop matching at a number and a '.' character, 
                                   # or stop matching at the end of the string,
                                   # and don't include this match in the results.
                    ')'
                    , text)

正则表达式说明图

这是使用的每个正则表达式字符的更详细说明

在正则表达式中,您缺少字符组中的小写字母,因此它仅与大写单词匹配

你可以简单地使用这个

(\d\.[\s\S]+?)(?=\d+\.|$)

在此处输入图片说明

样例代码

import re
text = """1. TITLE ABC Contents of 14 title ABC and some other text 2. TITLE BCD This would have contents on 
title BCD and maybe something else 3. TITLE CDC Contents of title cdc"""
result = new_s = re.findall('(\d\.[\s\S]+?)(?=\d+\.|$)', text)
print(result)

输出


['1. TITLE ABC Contents of 14 title ABC and some other text ', '2. TITLE BCD This would have contents on \ntitle BCD and maybe 
something else ', '3. TITLE CDC Contents of title cdc']

Regex demo

注意:-您甚至可以替换[\\s\\S]+? .*? 就像您正在使用单行标记一样. 也将匹配换行符

您可以将re.findallre.split re.findall使用:

import re
s = "1. TITLE ABC Contents of title ABC and some other text 2. TITLE BCD This would have contents on title BCD and maybe something else 3. TITLE CDC Contents of title cdc"
t, c = re.findall('\d+\.\s[A-Z]+', s), list(filter(None, re.split('\d+\.\s[A-Z]+', s)))
result = [f'{a}{b}' for a, b in zip(t, c)]

输出:

['1. TITLE ABC Contents of title ABC and some other text ', '2. TITLE BCD This would have contents on title BCD and maybe something else ', '3. TITLE CDC Contents of title cdc']
import re
a=r'1. TITLE ABC Contents of 2title ABC and some other text 2. TITLE BCD This would have contents on title BCD and maybe something else 3. TITLE CDC Contents of title cdc'
res = re.findall('(\d\.\s[A-Za-z0-9\s]*\s)', a)
for e in map(str, res):
    print(e)

输出

1. TITLE ABC Contents of 2title ABC and some other text 
2. TITLE BCD This would have contents on title BCD and maybe something else 
3. TITLE CDC Contents of title 

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM