[英]Regular expression only captures the last occurence of repeated group
I am trying to capture multiple "<attribute> = <value>" pairs with a Python regular expression from a string like this:我试图从这样的字符串中使用Python 正则表达式捕获多个 "<attribute> = <value>" 对:
some(code) ' <tag attrib1="some_value" attrib2="value2" en=""/>
The regular expression '\\s*<tag(?:\\s*(\\w+)\\s*=\\"(.*?)\\")*
is intended to match those pairs multiple times, ie return something like正则表达式'\\s*<tag(?:\\s*(\\w+)\\s*=\\"(.*?)\\")*
旨在多次匹配这些对,即返回类似
"attrib1", "some_value", "attrib2", "value2", "en", ""
but it only captures the last occurence:但它只捕获最后一次出现:
>>> import re
>>> re.search("'\s*<tag(?:\s*(\w+)\s*=\"(.*?)\")*", ' some(code) \' <tag attrib1="some_value" attrib2="value2" en=""/>').groups()
('en', '')
Focusing on <attrib>="<value>" works:专注于 <attrib>="<value>" 作品:
>>> re.findall("(?:\s*(\w+)\s*=\"(.*?)\")", ' some(code) \' <tag attrib1="some_value" attrib2="value2" en=""/>')
[('attrib1', 'some_value'), ('attrib2', 'value2'), ('en', '')]
so a pragmatic solution might be to test "<tag" in string
before running this regular expression, but..所以一个实用的解决方案可能是在运行这个正则表达式之前测试"<tag" in string
,但是..
Why does the original regex only capture the last occurence and what needs to be changed to make it work as intended?为什么原始正则表达式只捕获最后一次出现的情况以及需要更改哪些内容才能使其按预期工作?
This is just how regex works : you defined one capturing group, so there is only one capturing group.这就是正则表达式的工作原理:您定义了一个捕获组,因此只有一个捕获组。 When it first captures something, and then captures an other thing, the first captured item is replaced.That's why you only get the last captured one.当它首先捕获某物,然后捕获另一物时,第一个捕获的项目将被替换。这就是为什么您只能获得最后一个捕获的项目。
There is no solution for that that I am aware of...我所知道的没有解决方案......
Unfortunately this is not possible with python's re
module.不幸的是,python 的re
模块无法做到这一点。 But regex
provides captures
and capturesdict
functions for that:但是regex
为此提供了captures
和capturesdict
函数:
>>> m = regex.match(r"(?:(?P<word>\w+) (?P<digits>\d+)\n)+", "one 1\ntwo 2\nthree 3\n")
>>> m.groupdict()
{'word': 'three', 'digits': '3'}
>>> m.captures("word")
['one', 'two', 'three']
>>> m.captures("digits")
['1', '2', '3']
>>> m.capturesdict()
{'word': ['one', 'two', 'three'], 'digits': ['1', '2', '3']}
From the documentation search will return only one occurrence.从文档搜索中将只返回一次。 The findAll method returns all occurrences in the list. findAll 方法返回列表中的所有匹配项。 That is what you need to use, like in your second example.这就是您需要使用的,就像在您的第二个示例中一样。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.