简体   繁体   中英

Regex to match words followed by whitespace or punctuation

If I have the word india

MATCHES "india." "india!" "india." "india" "india." "india!" "india." "india"

NON MATCHES "indian" "indiana"

Basically, I want to match the string but not when its contained within another string.

After doing some research, I started with

exp = "(?<?\S)india(.!\S)" num_matches = len(re.findall(exp))

but that doesn't match the punctuation and I'm not sure where to add that in.

Assuming the objective is to match a given word (eg, "india" ) in a string provided the word is neither preceded nor followed by a character that is not in the string ".,?;;" you could use the following regex:

(?<![^ .,?!;])india(?![^ .,?!;\r\n])

Demo

Python's regex engine performs the following operations

(?<!             # begin a negative lookbehind
  [^ .,?!;]      # match 1 char other than those in " .,?!;"
)                # end the negative lookbehind
india            # match string
(?!              # begin a negative lookahead   
  [^ .,?!;\r\n]  # match 1 char other than those in " .,?!;\r\n"
)                # end the negative lookahead

Notice that the character class in the negative lookahead contains \r and \n in case india is at the end of a line.

Try with:

r'\bindia\W*\b'

See demo


To ignore case:

re.search(r'\bindia\W*\b', my_string, re.IGNORECASE).group(0)

you may use:

import re

s = "india."
s1 = "indiana"
print(re.search(r'\bindia[.!?]*\b', s))
print(re.search(r'\bindia[.!?]*\b', s1))

output:

<re.Match object; span=(0, 5), match='india'>
None
\"india(\W*?)\" 

this will catch anything except for numbers and letters

Try this ^india[^a-zA-Z0-9]$

^ - Regex starts with India

[^a-zA-Z0-9] - not az, AZ, 0-9

$ - End Regex

If you also want to match the punctuation, you could use make use of a negated character class where you could match any char except a word character or a newline.

(?<!\S)india[^\w\r\n]*(?!\S)
  • (?<!\S) Assert a whitspace bounadry to the left
  • india Match literally
  • [^\w\r\n] Match 0+ times any char except a word char or a newline
  • (?!\S) Assert a whitspace boundary to the right

Regex demo

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM