I am cleaning some data for text analysis that I extracted from PDFs. I have noticed that one of the errors is strange spacing in words that end in "y." Specifically, the final y is broken off from the word by a space: theor y
. I'm trying to use re.sub
to identify these instances and then collapse the space.
I've been able to write what I think is a good regex string (see https://regex101.com/r/M1jpe6/5 ), but I'm not getting the results that I expect. I suspect that I'm missing something about the re.sub
method.
Here is my toy code.
import re
string = 'this is my theor y of dance'
regex_y = r'\b\w*\b(\sy)\b'
new_string = re.sub(regex_y, 'y', string)
print(new_string)
What I expect to print from the above is
this is my theory of dance
but what it actually prints is
this is my y of dance
Since the only capturing group in my regex is (\\sy)
, I expect to substitute y
with y
. Instead, it's clear that I'm matching on the bigger string theor y
and then replacing that whole thing with y
.
Why is this happening when I'm only capturing (\\sy)
? How do I write my re.sub
string so it works as I intend?
Your example is a bit contrived, but if you wanted to remove whitespace before dangling y
characters, I would use this:
string = 'this is my theor y of dance'
string = re.sub(r'\b\s+y\b', 'y', string)
print(string)
this is my theory of dance
The problem with using capture groups here is that you want to display the entire input sentence, with some modifications. With a capture group approach, you would need to match and capture the entire string.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.