简体   繁体   English

正则表达式只捕获重复组的最后一次出现

[英]Regular expression only captures the last occurence of repeated group

I am trying to capture multiple "<attribute> = <value>" pairs with a Python regular expression from a string like this:我试图从这样的字符串中使用Python 正则表达式捕获多个 "<attribute> = <value>" 对:

  some(code) ' <tag attrib1="some_value" attrib2="value2"                   en=""/>

The regular expression '\\s*<tag(?:\\s*(\\w+)\\s*=\\"(.*?)\\")* is intended to match those pairs multiple times, ie return something like正则表达式'\\s*<tag(?:\\s*(\\w+)\\s*=\\"(.*?)\\")*旨在多次匹配这些对,即返回类似

"attrib1", "some_value", "attrib2", "value2", "en", ""

but it only captures the last occurence:但它只捕获最后一次出现:

>>> import re
>>> re.search("'\s*<tag(?:\s*(\w+)\s*=\"(.*?)\")*", '  some(code) \' <tag attrib1="some_value" attrib2="value2"                   en=""/>').groups()
('en', '')

Focusing on <attrib>="<value>" works:专注于 <attrib>="<value>" 作品:

>>> re.findall("(?:\s*(\w+)\s*=\"(.*?)\")", '  some(code) \' <tag attrib1="some_value" attrib2="value2"                   en=""/>')
[('attrib1', 'some_value'), ('attrib2', 'value2'), ('en', '')]

so a pragmatic solution might be to test "<tag" in string before running this regular expression, but..所以一个实用的解决方案可能是在运行这个正则表达式之前测试"<tag" in string ,但是..

Why does the original regex only capture the last occurence and what needs to be changed to make it work as intended?为什么原始正则表达式只捕获最后一次出现的情况以及需要更改哪些内容才能使其按预期工作?

This is just how regex works : you defined one capturing group, so there is only one capturing group.这就是正则表达式的工作原理:您定义了一个捕获组,因此只有一个捕获组。 When it first captures something, and then captures an other thing, the first captured item is replaced.That's why you only get the last captured one.当它首先捕获某物,然后捕获另一物时,第一个捕获的项目将被替换。这就是为什么您只能获得最后一个捕获的项目。
There is no solution for that that I am aware of...我所知道的没有解决方案......

Unfortunately this is not possible with python's re module.不幸的是,python 的re模块无法做到这一点。 But regex provides captures and capturesdict functions for that:但是regex为此提供了capturescapturesdict函数:

>>> m = regex.match(r"(?:(?P<word>\w+) (?P<digits>\d+)\n)+", "one 1\ntwo 2\nthree 3\n")
>>> m.groupdict()
{'word': 'three', 'digits': '3'}
>>> m.captures("word")
['one', 'two', 'three']
>>> m.captures("digits")
['1', '2', '3']
>>> m.capturesdict()
{'word': ['one', 'two', 'three'], 'digits': ['1', '2', '3']}

From the documentation search will return only one occurrence.文档搜索中将只返回一次。 The findAll method returns all occurrences in the list. findAll 方法返回列表中的所有匹配项。 That is what you need to use, like in your second example.这就是您需要使用的,就像在您的第二个示例中一样。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 正则表达式匹配字符串中的最后一次出现 - Regular expression match last occurence of year in string 捕获第一个单词和最后一个单词的首字母的正则表达式 - Regular expression that captures the first letter of the first word and last word 在正则表达式中查找最后一组 - Finding the last group in a regular expression python中的正则表达式获取URL或路径中文件扩展名的最后一次出现 - Regular expression in python to get the last occurence of a file extension in a URL or path 如何在Python列表中基于带有特定正则表达式的最后一次出现选择最后一个值/索引? - How to select the last value/index based on last occurence with a certain regular expression in a list in Python? 为什么这个正则表达式只捕获最后一个数字? - Why is this regular expression only capturing the last digit? 如何在Python中的正则表达式中仅使用重复字符组描述字符串 - How to describe string with only repeated characters groups in regular expression in Python 正则表达式仅匹配完全匹配而不匹配组 - Regular expression only matches full match and not group python match只捕获第一组和最后一组 - 我误解了什么吗? - python match only captures first and last group - am I misunderstanding something? Python正则表达式返回匹配的最后一个字符的额外捕获组 - Python regular expression returning extra capture group for last character matched
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM