简体   繁体   中英

Inverse regex match on group in Python

I see a lot of similarly worded questions, but I've had a strikingly difficult time coming up with the syntax for this.

Given a list of words, I want to print all the words that do not have special characters.

I have a regex which identifies words with special characters \\w*[\À-\ǚ']\\w* . I've seen a lot of answers with fairly straightforward scenarios like a simple word . However, I haven't been able to find anything that negates a group - I've seen several different sets of syntax to include the negative lookahead ?! , but I haven't been able to come up with a syntax that works with it.

In my case given a string like: "should print nŌt thìs"

should print should and print but not the other two words. re.findall("(\\w*[\À-\ǚ']\\w*)", paragraph.text) gives you the special characters - I just want to invert that.

For this particular case, you can simply specify the regular alphabet range in your search:

a = "should print nŌt thìs"
re.findall(r"(\b[A-Za-z]+\b)", a)
# ['should', 'print']

Of course you can add digits or anything else you want to match as well.

As for negative lookaheads, they use the syntax (?!...) , with ? before ! , and they must be in parentheses. To use one here, you can use:

r"\b(?!\w*[À-ǚ])\w*"

This:

  • Checks for a word boundary \\b , like a space or the start of the input string.
  • Does the negative lookahead and stops the match if it finds any special character preceded by 0 or more word characters. You have to include the \\w* because (?![À-ǚ]) would only check for the special character being the first letter in the word.
  • Finally, if it makes it past the lookahead, it matches any word characters.

Demo . Note in regex101.com you must specify Python flavor for \\b to work properly with special characters.

There is a third option as well:

r"\b[^À-ǚ\s]*\b"

The middle part [^À-ǚ\\s]* means match any character other than special characters or whitespace an unlimited number of times.

I know this is not a regex, but just a completely different idea you may not have had besides using regexes. I suppose it would be also much slower but I think it works:

>>> import unicodedata as ud    
>>> [word for word in ['Cá', 'Lá', 'Aqui']\
    if any(['WITH' in ud.name(letter) for letter in word])]
['Cá', 'Lá']

Or use ... 'WITH' not in to reverse.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM