简体   繁体   中英

Python regular expression: treat apostrophe as text

My goal is to find a word in a text. The caveat is that I need to treat apostrophes as text.

Let me explain with an example. Let's say I'm looking for the word don in the text: don't trust don . I need to match don but not don't .

I started with this regex: r'(?:\\b)%s(?:\\b)' % re.escape("don") but here it matches both occurrences of don . I then tried r'(?:\\b|\\w\\')%s(?:\\b|\\'\\w)' % re.escape("don") , to no avail.

How do I make my regex treat apostrophes as text?

Edit: Some of the edge cases I did not mention: 'don and don' are correct matches whereas t'don , don't and 'don' are not.

Use a negative look-ahead assertion:

r'(?:\b)%s(?!\'\w)(?:\b)'

I've put this on regex101 with a demo.

The negative lookahead makes the expression match don only if it is not immediately followed by '\\w . Your version matches anyway, because both of your options in (?:\\b|\\'\\w) match.

You could use something like this to treat all ' as word characters within your match:

r"(?<!')\b%s\b(?!')"

It's using a negative lookahead and a negative lookbehind to make sure that there're no ' around the word you want to match.

regex101 demo


EDIT: After your edge cases, I would suggest this regex instead:

r"(?<!\w')(?<!'(?=%s'))\b%s\b(?!'\w)" % re.escape("don")

regex101 demo

When matched against:

don't
o'don
'don'
don'
'don

Only the last two match.


EDIT2: If you still want to match strings ending or beginning in ' , then I would advise stepping back to the 'old way' of getting word boundaries too, ie trying to match spaces and beginning/end of lines:

(?<!\w')(?<!'(?=%s'))(?<=\b|^|\s)%s(?=\b|^|\s)(?!'\w)

Previous sentence demo

New test case demo

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM