简体   繁体   中英

Why does re.sub in Python not work correctly on this test case?

Try this code.

test = ' az z bz z z stuff z  z '
re.sub(r'(\W)(z)(\W)', r'\1_\2\3', test)

This should replace all stand-alone z's with _z

However, the result is:

' az _z bz _z z stuff _z _z '

You see there's az there that is missing. I theorize that it's because the grouping can't grab the space between the z's to match two z's at once (one for trailing whitespace, one for leading whitespace). Is there a way to fix this?

If your goal is to make sure you only match z when it's a standalone word, use \\b to match word boundaries without actually consuming the whitespace:

>>> re.sub(r'\b(z)\b', r'_\1', test)
' az _z bz _z _z stuff _z  _z '

You want to avoid capturing the whitespace. Try using the 0-width word break \\b , like this:

re.sub(r'\bz\b', '_z', test)

The reason why it does that is that you get an overlapping match; you need to not match the extra character - there are two ways you can do this; one is using \\b , the word boundary, as suggested by others, the other is using a lookbehind assertion and a lookahead assertion . (If reasonable, as it should probably be, use \\b instead of this solution. This is mainly here for educational purposes.)

>>> re.sub(r'(?<!\w)(z)(?!\w)', r'_\1', test)
' az _z bz _z _z stuff _z  _z '

(?<!\\w) makes sure there wasn't \\w before.

(?!\\w) makes sure there isn't \\w after.

The special (?...) syntax means they aren't groups, so the (z) is \\1 .


As for a graphical explanation of why it fails:

The regex is going through the string doing replacement; it's at these three characters:

' az _z bz z z stuff z  z '
          ^^^

It does that replacement. The final character has been acted upon, so its next step is approximately this:

' az _z bz _z z stuff z  z '
              ^^^ <- It starts matching here.
             ^ <- Not this character, it's been consumed by the last match

Use this:

test = ' az z bz z z stuff z  z '
re.sub(r'\b(z)\b', r'_\1', test)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM