简体   繁体   中英

How to fix re.sub capturing in Python regex?

I am cleaning some data for text analysis that I extracted from PDFs. I have noticed that one of the errors is strange spacing in words that end in "y." Specifically, the final y is broken off from the word by a space: theor y . I'm trying to use re.sub to identify these instances and then collapse the space.

I've been able to write what I think is a good regex string (see https://regex101.com/r/M1jpe6/5 ), but I'm not getting the results that I expect. I suspect that I'm missing something about the re.sub method.

Here is my toy code.

import re
string = 'this is my theor y of dance'
regex_y = r'\b\w*\b(\sy)\b'

new_string = re.sub(regex_y, 'y', string)
print(new_string)

What I expect to print from the above is

this is my theory of dance

but what it actually prints is

this is my y of dance

Since the only capturing group in my regex is (\\sy) , I expect to substitute y with y . Instead, it's clear that I'm matching on the bigger string theor y and then replacing that whole thing with y .

Why is this happening when I'm only capturing (\\sy) ? How do I write my re.sub string so it works as I intend?

Your example is a bit contrived, but if you wanted to remove whitespace before dangling y characters, I would use this:

string = 'this is my theor y of dance'
string = re.sub(r'\b\s+y\b', 'y', string)
print(string)

this is my theory of dance

The problem with using capture groups here is that you want to display the entire input sentence, with some modifications. With a capture group approach, you would need to match and capture the entire string.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM