简体   繁体   English

为什么re.findall()找到比re.sub()更多的匹配?

[英]Why does re.findall() find more matches than re.sub()?

Consider the following: 考虑以下:

>>> import re
>>> a = "first:second"
>>> re.findall("[^:]*", a)
['first', '', 'second', '']
>>> re.sub("[^:]*", r"(\g<0>)", a)
'(first):(second)'

re.sub() 's behavior makes more sense initially, but I can also understand re.findall() 's behavior. re.sub()的行为最初更有意义,但我也能理解re.findall()的行为。 After all, you can match an empty string between first and : that consists only of non-colon characters (exactly zero of them), but why isn't re.sub() behaving the same way? 毕竟,你可以匹配firstfirst一个之间的空字符串:它只包含非冒号字符(正好是零),但为什么re.sub()行为方式不同?

Shouldn't the result of the last command be (first)():(second)() ? 不应该是最后一个命令的结果是(first)():(second)()

You use the * which allows empty matches: 你使用允许空匹配的*:

'first'   -> matched
':'       -> not in the character class but, as the pattern can be empty due 
             to the *, an empty string is matched -->''
'second'  -> matched
'$'       -> can contain an empty string before,
             an empty string is matched -->''

Quoting the documentation for re.findall() : 引用re.findall()文档

Empty matches are included in the result unless they touch the beginning of another match. 结果中包含空匹配,除非它们触及另一个匹配的开头。

The reason you don't see empty matches in sub results is explained in the documentation for re.sub() : 您在子结果中看不到空匹配的原因在re.sub()文档中进行了解释:

Empty matches for the pattern are replaced only when not adjacent to a previous match. 仅当与前一个匹配不相邻时,才会替换模式的空匹配。

Try this: 尝试这个:

re.sub('(?:Choucroute garnie)*', '#', 'ornithorynque') 

And now this: 现在这个:

print re.sub('(?:nithorynque)*', '#', 'ornithorynque')

There is no consecutive # 没有连续的#

The algorithms for handling empty matches are different, for some reason. 由于某种原因,处理空匹配的算法是不同的。

In the case of findall , it works like (an optimized version of) this: for every possible start index 0 <= i <= len(a), if the string matches at i, then append the match; findall的情况下,它的工作方式类似于(优化版本):对于每个可能的起始索引0 <= i <= len(a),如果字符串在i处匹配,则追加匹配; and avoid overlapping results by using this rule: if there is a match of length m at i, don't look for the next match before i+m. 并且通过使用此规则避免重叠结果:如果在i处存在长度为m的匹配,则不要在i + m之前查找下一个匹配项。 The reason your example returns ['first', '', 'second', ''] is that the empty matches are found immediately after first and second , but not after the colon --- because looking for a match starting from that position returns the full string second . 你的例子返回['first', '', 'second', '']是在firstsecond之后立即找到空匹配,但不是在冒号之后找到 - 因为从那个位置开始寻找一个匹配返回完整的字符串second

In the case of sub , the difference is, as you noticed, that it explicitly ignores matches of length 0 that occurs immediately after another match. sub的情况下,正如您所注意到的那样,区别在于它明确忽略了在另一个匹配之后立即发生的长度为0的匹配。 While I see why this might help avoid unexpected behavior of sub , I'm unsure why there is this difference (eg why wouldn't findall use the same rule). 虽然我明白为什么这可能有助于避免出现意外行为sub ,我不确定为什么会存在这种差异(例如,为什么不findall使用相同的规则)。

import re
a = "first:second:three"
print re.findall("[^:]*", a)

returns all substring that match pattern, here, it gives 返回匹配模式的所有子字符串,这里给出

>>> 
['first', '', 'second', '', 'three', '']

sub() is for substitution, and will substitute the left-most non-overlapping occurrences of pattern with your substitute. sub()用于替换,并将替换最左边的非重叠模式。 ex

import re
a = "first:second:three"
print re.sub("[^:]*", r"smile", a)

gives

>>> 
smile:smile:smile

You can command the number of occurrences to be replaced with the 4th arg, count: 您可以使用第4个arg命令要替换的出现次数,count:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM