在特定文本正则表达式 python 之后捕获所有出现的 substring

Question

I have a long document in which the line of my interest starts with Categories: .我有一份很长的文档，其中我感兴趣的行以Categories:开头。 I want to find all words separated by , after Categories: .我想在Categories:之后找到所有由,分隔的单词。 Here's an example line这是一个示例行

Categories : Turbo Prop , Very Light , Light , Mid Size

I want to find start index and end index of Turbo Prop , Very Light , Light , Mid Size我想找到Turbo Prop 、 Very Light 、 Light 、 Mid Size的start index和end index

I am using following code我正在使用以下代码

regex_pattern = r"(?<=Categories : )([A-Za-z ]+(?:,)?)+"

matched_text = regex.search(regex_pattern,doc_tex)

But matched_text.groups() is only giving Mid Size .但是matched_text.groups()只给出Mid Size 。 In short, I want to find all occurences of group 1 after Categories .简而言之，我想在Categories之后找到group 1的所有出现。

Answer 1

Do it in two steps.分两步进行。 First split the line using : , then split the second part using , .首先使用:拆分行，然后使用,拆分第二部分。

category_string = line.split(':')[1]
categories = category_string.split(',')

Answer 2

It looks like the comments answered the OP's question, but for completeness I thought I'd post the answer they discuss.看起来评论回答了 OP 的问题，但为了完整起见，我想我会发布他们讨论的答案。 It looks like Python's re module does not store all all instances of a repeated capture group;看起来 Python 的 re 模块并没有存储重复捕获组的所有实例； see issue 7132 .请参阅问题 7132 。 The regex package, however, adds additional methods to handle repeated capture groups, including.然而，正则表达式package 添加了额外的方法来处理重复的捕获组，包括。

captures -Returns a list of the strings matched in a group or groups.捕获 - 返回在一个或多个组中匹配的字符串列表。
starts - Returns a list of the start positions. starts - 返回起始位置的列表。
ends - Returns a list of the end positions. ends - 返回结束位置的列表。
spans - Returns a list of the spans. spans - 返回跨度列表。 Compare with matchobject.span([group]).与 matchobject.span([group]) 比较。

Hence, using the regex package with the matchedobject.starts and matchedobject.ends methods should work.因此，将正则表达式 package 与matchedobject.starts和matchedobject.ends方法一起使用应该有效。

Answer 3

As you are using the PyPi regex module , you can get all captures per group, together with their start and end indices, using当您使用PyPi regex 模块时，您可以获得每组的所有捕获，以及它们的开始和结束索引，使用

import regex
text = "Categories : Turbo Prop , Very Light , Light , Mid Size"
regex_pattern = r"Categories\s*:(?:\s*([A-Za-z ]+)\b(?:\s*,)?)+"
m = regex.search(regex_pattern, text)
result = list(zip(m.captures(1),m.starts(1),m.ends(1)))
print(result) 
# => [('Turbo Prop', 13, 23), ('Very Light', 26, 36), ('Light', 39, 44), ('Mid Size', 47, 55)]

See the Python demo请参阅Python 演示

More details from PyPi regex documentation: PyPi regex文档中的更多详细信息：

A match object has additional methods which return information on all the successful matches of a repeated capture group.匹配 object 有额外的方法返回有关重复捕获组的所有成功匹配的信息。 These methods are:这些方法是：

matchobject.captures([group1, ...])

Returns a list of the strings matched in a group or groups.返回在一个或多个组中匹配的字符串列表。 Compare with matchobject.group([group1, ...]) .与matchobject.group([group1, ...])进行比较。

matchobject.starts([group])

Returns a list of the start positions.返回起始位置列表。 Compare with matchobject.start([group]) .与matchobject.start([group])进行比较。

matchobject.ends([group])

Returns a list of the end positions.返回结束位置的列表。 Compare with matchobject.end([group]) .与matchobject.end([group])进行比较。

matchobject.spans([group])

Returns a list of the spans.返回跨度列表。 Compare with matchobject.span([group]) .与matchobject.span([group])进行比较。

Note I had to revamp your regex a bit:请注意，我不得不稍微修改一下您的正则表达式：

Categories\s*: - matches Categories , zero or more whitespaces, : Categories\s*: - 匹配Categories ，零个或多个空格， :
(?:\s*([A-Za-z ]+)\b(?:\s*,)?)+ - one or more repetitions of (?:\s*([A-Za-z ]+)\b(?:\s*,)?)+ - 一次或多次重复
- \s* - zero or more whitespace chars \s* - 零个或多个空白字符
- ([A-Za-z ]+) - one or more ASCII letters or spaces ([A-Za-z ]+) - 一个或多个 ASCII 字母或空格
- \b - a word boundary (so, Group 1 value will end with a letter) \b - 单词边界（因此，第 1 组值将以字母结尾）
- (?:\s*,)? - an optional sequence of zero or more whitespace chars and a comma. - 零个或多个空白字符和逗号的可选序列。

在特定文本正则表达式 python 之后捕获所有出现的 substring

问题描述

3 个解决方案

解决方案1
1 2021-09-25 21:43:54

解决方案2
0 2021-09-26 04:01:29

解决方案3
0 2021-10-09 22:16:02

在特定文本正则表达式 python 之后捕获所有出现的 substring

问题描述

3 个解决方案

解决方案1 1 2021-09-25 21:43:54

解决方案2 0 2021-09-26 04:01:29

解决方案3 0 2021-10-09 22:16:02

解决方案1
1 2021-09-25 21:43:54

解决方案2
0 2021-09-26 04:01:29

解决方案3
0 2021-10-09 22:16:02