My goal is to find a word in a text. The caveat is that I need to treat apostrophes as text.
Let me explain with an example. Let's say I'm looking for the word don
in the text: don't trust don
. I need to match don
but not don't
.
I started with this regex: r'(?:\\b)%s(?:\\b)' % re.escape("don")
but here it matches both occurrences of don
. I then tried r'(?:\\b|\\w\\')%s(?:\\b|\\'\\w)' % re.escape("don")
, to no avail.
How do I make my regex treat apostrophes as text?
Edit: Some of the edge cases I did not mention: 'don
and don'
are correct matches whereas t'don
, don't
and 'don'
are not.
Use a negative look-ahead assertion:
r'(?:\b)%s(?!\'\w)(?:\b)'
I've put this on regex101 with a demo.
The negative lookahead makes the expression match don
only if it is not immediately followed by '\\w
. Your version matches anyway, because both of your options in (?:\\b|\\'\\w)
match.
You could use something like this to treat all '
as word characters within your match:
r"(?<!')\b%s\b(?!')"
It's using a negative lookahead and a negative lookbehind to make sure that there're no '
around the word you want to match.
EDIT: After your edge cases, I would suggest this regex instead:
r"(?<!\w')(?<!'(?=%s'))\b%s\b(?!'\w)" % re.escape("don")
When matched against:
don't
o'don
'don'
don'
'don
Only the last two match.
EDIT2: If you still want to match strings ending or beginning in '
, then I would advise stepping back to the 'old way' of getting word boundaries too, ie trying to match spaces and beginning/end of lines:
(?<!\w')(?<!'(?=%s'))(?<=\b|^|\s)%s(?=\b|^|\s)(?!'\w)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.