简体   繁体   中英

Using regex to find all phrases that are completely capitalized

I want to use regex to match with all substrings that are completely capitalized, included the spaces.

Right now I am using regexp: \\w*[AZ]\\s]

HERE IS Test WHAT ARE WE SAYING

Which returns:

HERE
IS
WHAT
ARE 
WE
SAYING

However, I would like it to match with all substrings that are allcaps, so that it returns:

HERE IS 
WHAT ARE WE SAYING 

You can use word boundaries \\b and [^\\s] to prevent starting and ending spaces. Put together it might look a little like:

import re
string = "HERE IS Test WHAT ARE WE SAYING is that OKAY"

matches = re.compile(r"\b[^\s][A-Z\s]+[^\s]\b")
matches.findall(string)

>>> ['HERE IS', 'WHAT ARE WE SAYING', 'OKAY']

One option is to use re.split with the pattern \\s*(?:\\w*[^AZ\\s]\\w*\\s*)+ :

input = "HERE IS Test WHAT ARE WE SAYING"
parts = re.split('\s*(?:\w*[^A-Z\s]\w*\s*)+', input)
print(parts);

['HERE IS', 'WHAT ARE WE SAYING']

The idea here is to split on any sequential cluster of words which contains one or more letter which is not uppercase.

You could use findall :

import re

text = 'HERE IS Test WHAT ARE WE SAYING'
print(re.findall('[\sA-Z]+(?![a-z])', text))

Output

['HERE IS ', ' WHAT ARE WE SAYING']

The pattern [\\sA-Z]+(?![az]) matches any space or capitalized letter, that is not followed by a non-capitalized letter. The notation (?![az]) is known as a negative lookahead (see Regular Expression Syntax ).

You can use [AZ ]+ to match capital letters and spaces, and use negative lookahead (?! ) and negative lookbehind (?<! ) to forbid the first and last character from being a space.

Finally, surrounding the pattern with \\b to match word boundaries will make it only match full words.

import re
text = "A ab ABC ABC abc Abc aBc abC C"
pattern = r'\b(?! )[A-Z ]+(?<! )\b'

re.findall(pattern, text)
>>> ['A', 'ABC ABC', 'C']

You can also use the following method:

>>> import re
>>> s = 'HERE IS Test WHAT ARE WE SAYING'
>>> print(re.findall('((?!\s+)[A-Z\s]+(?![a-z]+))', s))

OUTPUT:

['HERE IS ', 'WHAT ARE WE SAYING']

Using findall() without matching leading and trailing spaces:

re.findall(r"\b[A-Z]+(?:\s+[A-Z]+)*\b",s)                                                                            
Out: ['HERE IS', 'WHAT ARE WE SAYING']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM