从字符串python正则表达式中提取匹配组

Question

我正在尝试从Python字符串中提取匹配组，但遇到了问题。

该字符串如下所示。

1. TITLE ABC Contents of title ABC and some other text 2. TITLE BCD This would have contents on 
title BCD and maybe something else 3. TITLE CDC Contents of title cdc

而且我需要以数字和大写字母开头的任何内容作为标题，并提取该标题中的内容。

这是我期望的输出。

1. TITLE ABC Contents of title ABC and some other text
2. TITLE BCD This would have contents on title BCD and maybe something else 
3. TITLE CDC Contents of title cdc

我尝试了以下正则表达式

(\d\.\s[A-Z\s]*\s)

并获得以下内容。

1. TITLE ABC 
2. TITLE BCD 
3. TITLE CDC

如果我尝试在正则表达式的末尾添加。*，则匹配组会受到影响。 我想我在这里缺少一些简单的东西。 尝试了我所知道但无法解决的所有问题。

感谢您的帮助。

Answer 1

使用(\\d+\\.[\\da-z]* [AZ]+[\\S\\s]*?(?=\\d+\\.|$))

以下是相关代码

import re
text = """1. TITLE ABC Contents of title ABC and some other text 2. TITLE BCD This would have contents on
title BCD and maybe something else 3. TITLE CDC Contents of title cdc"""

result = re.findall('('
                    '\d+\.'   # Match a number and a '.' character
                    '[\da-z]*' # If present include any additional numbers/letters
                    '(?:\.[\da-z])*' # Match additional subpoints.
                                     # Each of these subpoints must start with a '.'
                                     # And then have any combination of numbers/letters
                    ' '   # Match a space. This is how we know to stop looking for subpoints, 
                          # and to start looking for capital letters
                    '[A-Z]+'  # Match at least one capital letter. 
                              # Use [A-Z]{2,} to match 2 or more capital letters
                    '[\S\s]*?'  # Match everything including newlines.
                                # Use .*? if you don't care about matching newlines
                    '(?=\d+\.|$)'  # Stop matching at a number and a '.' character, 
                                   # or stop matching at the end of the string,
                                   # and don't include this match in the results.
                    ')'
                    , text)

这是使用的每个正则表达式字符的更详细说明

Answer 2

在正则表达式中，您缺少字符组中的小写字母，因此它仅与大写单词匹配

你可以简单地使用这个

(\d\.[\s\S]+?)(?=\d+\.|$)

样例代码

import re
text = """1. TITLE ABC Contents of 14 title ABC and some other text 2. TITLE BCD This would have contents on 
title BCD and maybe something else 3. TITLE CDC Contents of title cdc"""
result = new_s = re.findall('(\d\.[\s\S]+?)(?=\d+\.|$)', text)
print(result)

输出

['1. TITLE ABC Contents of 14 title ABC and some other text ', '2. TITLE BCD This would have contents on \ntitle BCD and maybe 
something else ', '3. TITLE CDC Contents of title cdc']

Regex demo

注意：-您甚至可以替换[\\s\\S]+? 与.*? 就像您正在使用单行标记一样. 也将匹配换行符

Answer 3

您可以将re.findall与re.split re.findall使用：

import re
s = "1. TITLE ABC Contents of title ABC and some other text 2. TITLE BCD This would have contents on title BCD and maybe something else 3. TITLE CDC Contents of title cdc"
t, c = re.findall('\d+\.\s[A-Z]+', s), list(filter(None, re.split('\d+\.\s[A-Z]+', s)))
result = [f'{a}{b}' for a, b in zip(t, c)]

输出：

['1. TITLE ABC Contents of title ABC and some other text ', '2. TITLE BCD This would have contents on title BCD and maybe something else ', '3. TITLE CDC Contents of title cdc']

Answer 4

import re
a=r'1. TITLE ABC Contents of 2title ABC and some other text 2. TITLE BCD This would have contents on title BCD and maybe something else 3. TITLE CDC Contents of title cdc'
res = re.findall('(\d\.\s[A-Za-z0-9\s]*\s)', a)
for e in map(str, res):
    print(e)

输出

1. TITLE ABC Contents of 2title ABC and some other text 
2. TITLE BCD This would have contents on title BCD and maybe something else 
3. TITLE CDC Contents of title

从字符串python正则表达式中提取匹配组

问题描述

4 个解决方案

解决方案1
2 已采纳 2019-09-17 01:59:12

解决方案2
1 2019-09-17 01:55:45

解决方案3
0 2019-09-17 01:51:00

解决方案4
0 2019-09-17 02:09:32

从字符串python正则表达式中提取匹配组

问题描述

4 个解决方案

解决方案1 2 已采纳 2019-09-17 01:59:12

解决方案2 1 2019-09-17 01:55:45

解决方案3 0 2019-09-17 01:51:00

解决方案4 0 2019-09-17 02:09:32

解决方案1
2 已采纳 2019-09-17 01:59:12

解决方案2
1 2019-09-17 01:55:45

解决方案3
0 2019-09-17 01:51:00

解决方案4
0 2019-09-17 02:09:32