[英]Capture all occurences of substring after specific text regex python
I have a long document in which the line of my interest starts with Categories:
.我有一份很长的文档,其中我感兴趣的行以
Categories:
开头。 I want to find all words separated by ,
after Categories:
.我想在
Categories:
之后找到所有由,
分隔的单词。 Here's an example line这是一个示例行
Categories : Turbo Prop , Very Light , Light , Mid Size
I want to find start index
and end index
of Turbo Prop
, Very Light
, Light
, Mid Size
我想找到
Turbo Prop
、 Very Light
、 Light
、 Mid Size
的start index
和end index
I am using following code我正在使用以下代码
regex_pattern = r"(?<=Categories : )([A-Za-z ]+(?:,)?)+"
matched_text = regex.search(regex_pattern,doc_tex)
But matched_text.groups()
is only giving Mid Size
.但是
matched_text.groups()
只给出Mid Size
。 In short, I want to find all occurences of group 1
after Categories
.简而言之,我想在
Categories
之后找到group 1
的所有出现。
Do it in two steps.分两步进行。 First split the line using
:
, then split the second part using ,
.首先使用
:
拆分行,然后使用,
拆分第二部分。
category_string = line.split(':')[1]
categories = category_string.split(',')
It looks like the comments answered the OP's question, but for completeness I thought I'd post the answer they discuss.看起来评论回答了 OP 的问题,但为了完整起见,我想我会发布他们讨论的答案。 It looks like Python's re module does not store all all instances of a repeated capture group;
看起来 Python 的 re 模块并没有存储重复捕获组的所有实例; see issue 7132 .
请参阅问题 7132 。 The regex package, however, adds additional methods to handle repeated capture groups, including.
然而,正则表达式package 添加了额外的方法来处理重复的捕获组,包括。
Hence, using the regex package with the matchedobject.starts
and matchedobject.ends
methods should work.因此,将正则表达式 package 与
matchedobject.starts
和matchedobject.ends
方法一起使用应该有效。
As you are using the PyPi regex module , you can get all captures per group, together with their start and end indices, using当您使用PyPi regex 模块时,您可以获得每组的所有捕获,以及它们的开始和结束索引,使用
import regex
text = "Categories : Turbo Prop , Very Light , Light , Mid Size"
regex_pattern = r"Categories\s*:(?:\s*([A-Za-z ]+)\b(?:\s*,)?)+"
m = regex.search(regex_pattern, text)
result = list(zip(m.captures(1),m.starts(1),m.ends(1)))
print(result)
# => [('Turbo Prop', 13, 23), ('Very Light', 26, 36), ('Light', 39, 44), ('Mid Size', 47, 55)]
See the Python demo请参阅Python 演示
More details from PyPi regex
documentation: PyPi
regex
文档中的更多详细信息:
A match object has additional methods which return information on all the successful matches of a repeated capture group.
匹配 object 有额外的方法返回有关重复捕获组的所有成功匹配的信息。 These methods are:
这些方法是:
matchobject.captures([group1, ...])
- Returns a list of the strings matched in a group or groups.
返回在一个或多个组中匹配的字符串列表。 Compare with
matchobject.group([group1, ...])
.与
matchobject.group([group1, ...])
进行比较。matchobject.starts([group])
- Returns a list of the start positions.
返回起始位置列表。 Compare with
matchobject.start([group])
.与
matchobject.start([group])
进行比较。matchobject.ends([group])
- Returns a list of the end positions.
返回结束位置的列表。 Compare with
matchobject.end([group])
.与
matchobject.end([group])
进行比较。matchobject.spans([group])
- Returns a list of the spans.
返回跨度列表。 Compare with
matchobject.span([group])
.与
matchobject.span([group])
进行比较。
Note I had to revamp your regex a bit:请注意,我不得不稍微修改一下您的正则表达式:
Categories\s*:
- matches Categories
, zero or more whitespaces, :
Categories\s*:
- 匹配Categories
,零个或多个空格, :
(?:\s*([A-Za-z ]+)\b(?:\s*,)?)+
- one or more repetitions of (?:\s*([A-Za-z ]+)\b(?:\s*,)?)+
- 一次或多次重复
\s*
- zero or more whitespace chars \s*
- 零个或多个空白字符([A-Za-z ]+)
- one or more ASCII letters or spaces ([A-Za-z ]+)
- 一个或多个 ASCII 字母或空格\b
- a word boundary (so, Group 1 value will end with a letter) \b
- 单词边界(因此,第 1 组值将以字母结尾)(?:\s*,)?
- an optional sequence of zero or more whitespace chars and a comma.
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.