how to use re.findall to find the words that is NOT of all uppercase letters?

Question

For example I have s="I REALLY don't want to talk about it, not at all!"

I want re.findall(reg, s) to return "I" "don't" "want" "to" "talk" "about" "it" "," "not" "at" "all" "!"

So far I got reg=r'[^\\w\\s]+|\\w+|\\n' which can not filter out the word "REALLY"

thanks

Answer 1

The \\w+ pattern matches 1 or more any word chars, including words in ALLCAPS.

Note that I , a pronoun, is also ALLCAPS. Thus, assuming you want to skip all ALLCAPS words of 2 or more letters, you may consider fixing your current pattern as

r'[^\w\s]+|\b(?![A-Z]{2,}\b)\w+|\n'

See the regex demo

The \\b(?![AZ]{2,}\\b)\\w+ pattern matches

\\b - word boundary
(?![AZ]{2,}\\b) - a negative lookahead that fails the match if, immediately to the right of the current location, there are 2 or more ASCII uppercase letters followed with a word boundary
\\w+ - 1 or more word chars (if you only want to match letters, replace with [^\\W\\d_]+ ).

To support all Unicode uppercase letters, you may use PyPi regex with r'[^\\w\\s]+|\\b(?!\\p{Lu}{2,}\\b)\\w+|\\n' pattern, or build the class using pLu = '[{}]'.format("".join([chr(i) for i in range(sys.maxunicode) if chr(i).isupper()])) (Python 3) or pLu = u'[{}]'.format(u"".join([unichr(i) for i in xrange(sys.maxunicode) if unichr(i).isupper()])) (Python 2). See Python regex for unicode capitalized words . Note I'd recommend sticking to the latest Python versions or the latest PyPi regex modules.

Answer 2

This quote by Brian Kernighan is especially true for regular expressions.

Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it?

So if something is difficult to do in a single regular expression, you might want to split it into two steps. Find all words first, and then filter out the all uppercase words. Easier to understand and easier to test.

>>> import re
>>> s="I REALLY don't want to talk about it, not at all!"
>>> words = re.findall(r"[\w']+", s)
>>> words = [w for w in words if w.upper() != w]
>>> print(words)
["don't", 'want', 'to', 'talk', 'about', 'it', 'not', 'at', 'all']

how to use re.findall to find the words that is NOT of all uppercase letters?

Question

2 answers

solution1
2 2018-11-15 12:21:51

solution2
1 2018-11-15 12:37:43

how to use re.findall to find the words that is NOT of all uppercase letters?

Question

2 answers

solution1 2 2018-11-15 12:21:51

solution2 1 2018-11-15 12:37:43

solution1
2 2018-11-15 12:21:51

solution2
1 2018-11-15 12:37:43