简体   繁体   English

在特定文本正则表达式 python 之后捕获所有出现的 substring

[英]Capture all occurences of substring after specific text regex python

I have a long document in which the line of my interest starts with Categories: .我有一份很长的文档,其中我感兴趣的行以Categories:开头。 I want to find all words separated by , after Categories: .我想在Categories:之后找到所有由,分隔的单词。 Here's an example line这是一个示例行

Categories : Turbo Prop , Very Light , Light , Mid Size

I want to find start index and end index of Turbo Prop , Very Light , Light , Mid Size我想找到Turbo PropVery LightLightMid Sizestart indexend index

I am using following code我正在使用以下代码

regex_pattern = r"(?<=Categories : )([A-Za-z ]+(?:,)?)+"

matched_text = regex.search(regex_pattern,doc_tex)

But matched_text.groups() is only giving Mid Size .但是matched_text.groups()只给出Mid Size In short, I want to find all occurences of group 1 after Categories .简而言之,我想在Categories之后找到group 1的所有出现。

Do it in two steps.分两步进行。 First split the line using : , then split the second part using , .首先使用:拆分行,然后使用,拆分第二部分。

category_string = line.split(':')[1]
categories = category_string.split(',')

It looks like the comments answered the OP's question, but for completeness I thought I'd post the answer they discuss.看起来评论回答了 OP 的问题,但为了完整起见,我想我会发布他们讨论的答案。 It looks like Python's re module does not store all all instances of a repeated capture group;看起来 Python 的 re 模块并没有存储重复捕获组的所有实例; see issue 7132 .请参阅问题 7132 The regex package, however, adds additional methods to handle repeated capture groups, including.然而,正则表达式package 添加了额外的方法来处理重复的捕获组,包括。

  • captures -Returns a list of the strings matched in a group or groups.捕获 - 返回在一个或多个组中匹配的字符串列表。
  • starts - Returns a list of the start positions. starts - 返回起始位置的列表。
  • ends - Returns a list of the end positions. ends - 返回结束位置的列表。
  • spans - Returns a list of the spans. spans - 返回跨度列表。 Compare with matchobject.span([group]).与 matchobject.span([group]) 比较。

Hence, using the regex package with the matchedobject.starts and matchedobject.ends methods should work.因此,将正则表达式 package 与matchedobject.startsmatchedobject.ends方法一起使用应该有效。

As you are using the PyPi regex module , you can get all captures per group, together with their start and end indices, using当您使用PyPi regex 模块时,您可以获得每组的所有捕获,以及它们的开始和结束索引,使用

import regex
text = "Categories : Turbo Prop , Very Light , Light , Mid Size"
regex_pattern = r"Categories\s*:(?:\s*([A-Za-z ]+)\b(?:\s*,)?)+"
m = regex.search(regex_pattern, text)
result = list(zip(m.captures(1),m.starts(1),m.ends(1)))
print(result) 
# => [('Turbo Prop', 13, 23), ('Very Light', 26, 36), ('Light', 39, 44), ('Mid Size', 47, 55)]

See the Python demo请参阅Python 演示

More details from PyPi regex documentation: PyPi regex文档中的更多详细信息:

A match object has additional methods which return information on all the successful matches of a repeated capture group.匹配 object 有额外的方法返回有关重复捕获组的所有成功匹配的信息。 These methods are:这些方法是:

  • matchobject.captures([group1, ...])
    • Returns a list of the strings matched in a group or groups.返回在一个或多个组中匹配的字符串列表。 Compare with matchobject.group([group1, ...]) .matchobject.group([group1, ...])进行比较。
  • matchobject.starts([group])
    • Returns a list of the start positions.返回起始位置列表。 Compare with matchobject.start([group]) .matchobject.start([group])进行比较。
  • matchobject.ends([group])
    • Returns a list of the end positions.返回结束位置的列表。 Compare with matchobject.end([group]) .matchobject.end([group])进行比较。
  • matchobject.spans([group])
    • Returns a list of the spans.返回跨度列表。 Compare with matchobject.span([group]) .matchobject.span([group])进行比较。

Note I had to revamp your regex a bit:请注意,我不得不稍微修改一下您的正则表达式:

  • Categories\s*: - matches Categories , zero or more whitespaces, : Categories\s*: - 匹配Categories ,零个或多个空格, :
  • (?:\s*([A-Za-z ]+)\b(?:\s*,)?)+ - one or more repetitions of (?:\s*([A-Za-z ]+)\b(?:\s*,)?)+ - 一次或多次重复
    • \s* - zero or more whitespace chars \s* - 零个或多个空白字符
    • ([A-Za-z ]+) - one or more ASCII letters or spaces ([A-Za-z ]+) - 一个或多个 ASCII 字母或空格
    • \b - a word boundary (so, Group 1 value will end with a letter) \b - 单词边界(因此,第 1 组值将以字母结尾)
    • (?:\s*,)? - an optional sequence of zero or more whitespace chars and a comma. - 零个或多个空白字符和逗号的可选序列。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM