[英]Extract matching groups from string python regex
我正在尝试从Python字符串中提取匹配组,但遇到了问题。
该字符串如下所示。
1. TITLE ABC Contents of title ABC and some other text 2. TITLE BCD This would have contents on
title BCD and maybe something else 3. TITLE CDC Contents of title cdc
而且我需要以数字和大写字母开头的任何内容作为标题,并提取该标题中的内容。
这是我期望的输出。
1. TITLE ABC Contents of title ABC and some other text
2. TITLE BCD This would have contents on title BCD and maybe something else
3. TITLE CDC Contents of title cdc
我尝试了以下正则表达式
(\d\.\s[A-Z\s]*\s)
并获得以下内容。
1. TITLE ABC
2. TITLE BCD
3. TITLE CDC
如果我尝试在正则表达式的末尾添加。*,则匹配组会受到影响。 我想我在这里缺少一些简单的东西。 尝试了我所知道但无法解决的所有问题。
感谢您的帮助。
使用(\\d+\\.[\\da-z]* [AZ]+[\\S\\s]*?(?=\\d+\\.|$))
以下是相关代码
import re
text = """1. TITLE ABC Contents of title ABC and some other text 2. TITLE BCD This would have contents on
title BCD and maybe something else 3. TITLE CDC Contents of title cdc"""
result = re.findall('('
'\d+\.' # Match a number and a '.' character
'[\da-z]*' # If present include any additional numbers/letters
'(?:\.[\da-z])*' # Match additional subpoints.
# Each of these subpoints must start with a '.'
# And then have any combination of numbers/letters
' ' # Match a space. This is how we know to stop looking for subpoints,
# and to start looking for capital letters
'[A-Z]+' # Match at least one capital letter.
# Use [A-Z]{2,} to match 2 or more capital letters
'[\S\s]*?' # Match everything including newlines.
# Use .*? if you don't care about matching newlines
'(?=\d+\.|$)' # Stop matching at a number and a '.' character,
# or stop matching at the end of the string,
# and don't include this match in the results.
')'
, text)
这是使用的每个正则表达式字符的更详细说明
在正则表达式中,您缺少字符组中的小写字母,因此它仅与大写单词匹配
你可以简单地使用这个
(\d\.[\s\S]+?)(?=\d+\.|$)
样例代码
import re
text = """1. TITLE ABC Contents of 14 title ABC and some other text 2. TITLE BCD This would have contents on
title BCD and maybe something else 3. TITLE CDC Contents of title cdc"""
result = new_s = re.findall('(\d\.[\s\S]+?)(?=\d+\.|$)', text)
print(result)
输出
['1. TITLE ABC Contents of 14 title ABC and some other text ', '2. TITLE BCD This would have contents on \ntitle BCD and maybe
something else ', '3. TITLE CDC Contents of title cdc']
注意:-您甚至可以替换[\\s\\S]+?
与.*?
就像您正在使用单行标记一样.
也将匹配换行符
您可以将re.findall
与re.split
re.findall
使用:
import re
s = "1. TITLE ABC Contents of title ABC and some other text 2. TITLE BCD This would have contents on title BCD and maybe something else 3. TITLE CDC Contents of title cdc"
t, c = re.findall('\d+\.\s[A-Z]+', s), list(filter(None, re.split('\d+\.\s[A-Z]+', s)))
result = [f'{a}{b}' for a, b in zip(t, c)]
输出:
['1. TITLE ABC Contents of title ABC and some other text ', '2. TITLE BCD This would have contents on title BCD and maybe something else ', '3. TITLE CDC Contents of title cdc']
import re
a=r'1. TITLE ABC Contents of 2title ABC and some other text 2. TITLE BCD This would have contents on title BCD and maybe something else 3. TITLE CDC Contents of title cdc'
res = re.findall('(\d\.\s[A-Za-z0-9\s]*\s)', a)
for e in map(str, res):
print(e)
输出
1. TITLE ABC Contents of 2title ABC and some other text
2. TITLE BCD This would have contents on title BCD and maybe something else
3. TITLE CDC Contents of title
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.