简体   繁体   中英

how to use re.findall to find the words that is NOT of all uppercase letters?

For example I have s="I REALLY don't want to talk about it, not at all!"

I want re.findall(reg, s) to return "I" "don't" "want" "to" "talk" "about" "it" "," "not" "at" "all" "!"

So far I got reg=r'[^\\w\\s]+|\\w+|\\n' which can not filter out the word "REALLY"

thanks

The \\w+ pattern matches 1 or more any word chars, including words in ALLCAPS.

Note that I , a pronoun, is also ALLCAPS. Thus, assuming you want to skip all ALLCAPS words of 2 or more letters, you may consider fixing your current pattern as

r'[^\w\s]+|\b(?![A-Z]{2,}\b)\w+|\n'

See the regex demo

The \\b(?![AZ]{2,}\\b)\\w+ pattern matches

  • \\b - word boundary
  • (?![AZ]{2,}\\b) - a negative lookahead that fails the match if, immediately to the right of the current location, there are 2 or more ASCII uppercase letters followed with a word boundary
  • \\w+ - 1 or more word chars (if you only want to match letters, replace with [^\\W\\d_]+ ).

To support all Unicode uppercase letters, you may use PyPi regex with r'[^\\w\\s]+|\\b(?!\\p{Lu}{2,}\\b)\\w+|\\n' pattern, or build the class using pLu = '[{}]'.format("".join([chr(i) for i in range(sys.maxunicode) if chr(i).isupper()])) (Python 3) or pLu = u'[{}]'.format(u"".join([unichr(i) for i in xrange(sys.maxunicode) if unichr(i).isupper()])) (Python 2). See Python regex for unicode capitalized words . Note I'd recommend sticking to the latest Python versions or the latest PyPi regex modules.

This quote by Brian Kernighan is especially true for regular expressions.

Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it?

So if something is difficult to do in a single regular expression, you might want to split it into two steps. Find all words first, and then filter out the all uppercase words. Easier to understand and easier to test.

>>> import re
>>> s="I REALLY don't want to talk about it, not at all!"
>>> words = re.findall(r"[\w']+", s)
>>> words = [w for w in words if w.upper() != w]
>>> print(words)
["don't", 'want', 'to', 'talk', 'about', 'it', 'not', 'at', 'all']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM