简体   繁体   English

如何使用re.findall查找不是全部大写字母的单词?

[英]how to use re.findall to find the words that is NOT of all uppercase letters?

For example I have s="I REALLY don't want to talk about it, not at all!" 例如,我有s="I REALLY don't want to talk about it, not at all!"

I want re.findall(reg, s) to return "I" "don't" "want" "to" "talk" "about" "it" "," "not" "at" "all" "!" 我希望re.findall(reg, s)返回"I" "don't" "want" "to" "talk" "about" "it" "," "not" "at" "all" "!"

So far I got reg=r'[^\\w\\s]+|\\w+|\\n' which can not filter out the word "REALLY" 到目前为止,我得到reg=r'[^\\w\\s]+|\\w+|\\n' ,它们无法过滤掉"REALLY"这个词

thanks 谢谢

The \\w+ pattern matches 1 or more any word chars, including words in ALLCAPS. \\w+模式匹配1个或多个单词字符,包括ALLCAPS中的单词。

Note that I , a pronoun, is also ALLCAPS. 请注意, I ,一个代名词,也是ALLCAPS。 Thus, assuming you want to skip all ALLCAPS words of 2 or more letters, you may consider fixing your current pattern as 因此,假设您要跳过2个或更多字母的所有ALLCAPS单词,您可以考虑将当前模式修改为

r'[^\w\s]+|\b(?![A-Z]{2,}\b)\w+|\n'

See the regex demo 请参阅正则表达式演示

The \\b(?![AZ]{2,}\\b)\\w+ pattern matches \\b(?![AZ]{2,}\\b)\\w+模式匹配

  • \\b - word boundary \\b - 单词边界
  • (?![AZ]{2,}\\b) - a negative lookahead that fails the match if, immediately to the right of the current location, there are 2 or more ASCII uppercase letters followed with a word boundary (?![AZ]{2,}\\b) - 如果在当前位置的右侧,有两个或更多ASCII大写字母后跟一个单词边界,则匹配失败的负前瞻
  • \\w+ - 1 or more word chars (if you only want to match letters, replace with [^\\W\\d_]+ ). \\w+ - 1个或更多单词字符(如果你只想匹配字母,请替换为[^\\W\\d_]+ )。

To support all Unicode uppercase letters, you may use PyPi regex with r'[^\\w\\s]+|\\b(?!\\p{Lu}{2,}\\b)\\w+|\\n' pattern, or build the class using pLu = '[{}]'.format("".join([chr(i) for i in range(sys.maxunicode) if chr(i).isupper()])) (Python 3) or pLu = u'[{}]'.format(u"".join([unichr(i) for i in xrange(sys.maxunicode) if unichr(i).isupper()])) (Python 2). 要支持所有Unicode大写字母,您可以将PyPi正则表达式与r'[^\\w\\s]+|\\b(?!\\p{Lu}{2,}\\b)\\w+|\\n'模式一起使用,或者使用pLu = '[{}]'.format("".join([chr(i) for i in range(sys.maxunicode) if chr(i).isupper()]))构建类pLu = '[{}]'.format("".join([chr(i) for i in range(sys.maxunicode) if chr(i).isupper()])) (Python 3)或者pLu = u'[{}]'.format(u"".join([unichr(i) for i in xrange(sys.maxunicode) if unichr(i).isupper()])) (Python 2)。 See Python regex for unicode capitalized words . 有关unicode大写单词,请参阅Python正则表达式 Note I'd recommend sticking to the latest Python versions or the latest PyPi regex modules. 注意我建议坚持使用最新的Python版本或最新的PyPi正则表达式模块。

This quote by Brian Kernighan is especially true for regular expressions. Brian Kernighan的引用对于正则表达式尤其如此。

Everyone knows that debugging is twice as hard as writing a program in the first place. 每个人都知道调试的难度是首先编写程序的两倍。 So if you're as clever as you can be when you write it, how will you ever debug it? 因此,如果你在编写它时就像你一样聪明,你将如何调试它?

So if something is difficult to do in a single regular expression, you might want to split it into two steps. 因此,如果在单个正则表达式中难以执行某些操作,则可能需要将其拆分为两个步骤。 Find all words first, and then filter out the all uppercase words. 首先查找所有单词,然后过滤掉所有大写单词。 Easier to understand and easier to test. 更容易理解,更容易测试。

>>> import re
>>> s="I REALLY don't want to talk about it, not at all!"
>>> words = re.findall(r"[\w']+", s)
>>> words = [w for w in words if w.upper() != w]
>>> print(words)
["don't", 'want', 'to', 'talk', 'about', 'it', 'not', 'at', 'all']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM