简体   繁体   English

使用替换函数时,为什么反向引用不能在Python的re.sub中工作?

[英]Why don't backreferences work in Python's re.sub when using a replacement function?

Using re.sub in Python 2.7, the following example uses a simple backreference: 在Python 2.7中使用re.sub ,以下示例使用简单的反向引用:

re.sub('-{1,2}', r'\g<0> ', 'pro----gram-files')

It outputs the following string as expected: 它按预期输出以下字符串:

'pro-- -- gram- files'

I would expect the following example to be identical, but it is not: 我希望以下示例相同,但它不是:

def dashrepl(matchobj):
    return r'\g<0> '
re.sub('-{1,2}', dashrepl, 'pro----gram-files')

This gives the following unexpected output: 这会产生以下意外输出:

'pro\\g<0> \\g<0> gram\\g<0> files'

Why do the two examples give different output? 为什么这两个例子给出不同的输出? Did I miss something in the documentation that explains this? 我是否遗漏了解释此问题的文档? Is there any particular reason that this behavior is preferable to what I expected? 这种行为是否比我预期的更好? Is there a way to use backreferences in a replacement function? 有没有办法在替换函数中使用反向引用?

As there are simpler ways to achieve your goal, you can use them. 由于有更简单的方法来实现您的目标,您可以使用它们。

As you already see, your replacement function gets a match object as it argument. 正如您已经看到的,您的替换函数会获取匹配对象作为参数。

This object has, among others, a method group() which can be used instead: 除其他外,该对象具有一个方法group() ,可以使用它:

def dashrepl(matchobj):
    return matchobj.group(0) + ' '

which will give exactly your result. 这将给出你的结果。


But you are completely right - the docs are a bit confusing in that way: 但你完全正确 - 文档有点令人困惑:

they describe the repl argument: 他们描述了repl参数:

repl can be a string or a function; repl可以是字符串或函数; if it is a string, any backslash escapes in it are processed. 如果它是一个字符串,则处理其中的任何反斜杠转义。

and

If repl is a function, it is called for every non-overlapping occurrence of pattern. 如果repl是一个函数,则会为每个非重叠的模式调用调用它。 The function takes a single match object argument, and returns the replacement string. 该函数接受单个匹配对象参数,并返回替换字符串。

You could interpret this as if "the replacement string" returned by the function would also apply to the processment of backslash escapes. 可以解释这个,好像函数返回的“替换字符串”也适用于反斜杠转义的处理。

But as this processment is described only for the case that "it is a string", it becomes clearer, but not obvious at the first glance. 但由于此处理仅针对“它是一个字符串”的情况进行描述,因此它变得更清晰,但乍一看并不明显。

If you pass in a function to re.sub , it allows you to replace the match with the string that is returned from the function. 如果将函数传递给re.sub ,则允许您将匹配替换为从函数返回的字符串。 Basically, re.sub uses different code paths depending on if you pass a function or a string. 基本上, re.sub使用不同的代码路径,具体取决于您是否传递函数或字符串。 And yes, this is in fact desireable. 是的,这实际上是可取的。 Consider the case where you want to replace matches of foo with bar and matches of baz with qux . 考虑以下情况:您希望将foo匹配替换为bar ,将baz匹配替换为qux You can then write it as: 然后你可以把它写成:

repdict = {'foo':'bar','baz':'qux'}
re.sub('foo|baz',lambda match: repdict[match.group(0)],'foo')

You could argue that you could do this in 2 passes, but you can't do that if repdict looks like {'foo':'baz','baz':'qux'} 你可以争辩说你可以在2次传球中做到这一点,但如果repdict看起来像{'foo':'baz','baz':'qux'}你就不能这样做

And I don't think you can do that with back-references (at least not easily). 而且我认为你不能用反向引用来做到这一点(至少不容易)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM