I'm trying to match words that have more than 1 letter and: are all-upercase, first letter lowercase and following letters uppercase, or containing a hyphen in the middle ONLY if all the letters are uppercase. This is my code:
s = "ASCII, aSCII, AS-CII, AS-cii"
myset = set(re.findall(r"\b[a-z]?[A-Z]+\-?[A-Z]{1,}",s))
Out[555]: {'AS', 'AS-CII', 'ASCII', 'aSCII'}
As you can see, the "AS"
shouldn't be returned because it contains lower case letters after the hyphen. How could I fix this?
Tried this but the result is an error:
myset = set(re.findall(r"\b[a-z]?[A-Z]+\-?[A-Z]+{1,}",s))
File "<ipython-input-545-7bdc0c902553>"
myset = set(re.findall(r"\b[a-z]?[A-Z]+\-?[A-Z]+{1,}",s))
File "/home/c1962135/.local/share/virtualenvs/c1962135-9R_1M4TP/lib/python3.6/re.py", line 222, in findall
return _compile(pattern, flags).findall(string)
File "/home/c1962135/.local/share/virtualenvs/c1962135-9R_1M4TP/lib/python3.6/re.py", line 301, in _compile
p = sre_compile.compile(pattern, flags)
File "/home/c1962135/.local/share/virtualenvs/c1962135-9R_1M4TP/lib/python3.6/sre_compile.py", line 562, in compile
p = sre_parse.parse(p, flags)
File "/home/c1962135/.local/share/virtualenvs/c1962135-9R_1M4TP/lib/python3.6/sre_parse.py", line 855, in parse
p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
File "/home/c1962135/.local/share/virtualenvs/c1962135-9R_1M4TP/lib/python3.6/sre_parse.py", line 416, in _parse_sub
not nested and not items))
File "/home/c1962135/.local/share/virtualenvs/c1962135-9R_1M4TP/lib/python3.6/sre_parse.py", line 619, in _parse
source.tell() - here + len(this))
error: multiple repeat
Here we are
res = [x[0] for x in re.findall(r"(([a-z]{1}[A-Z]+)|([A-Z]+\-[A-Z]+))",s)]
print(res)
print(set(res))
gives
['aSCII', 'AS-CII']
Tell me. I splitted to add OR logic with | between.
You could use a conditional expression :
(...)?(if true than this|else this)
For your case, this could be
\b([a-z])?(?(1)[A-Z]+|[-A-Z]+[A-Z])(?!-)\b
See a demo on regex101.com .
\b # a word boundary ([az])? # match a lower case letter if it is there (?(1) # if the lower case letter is there, match this branch [AZ]+ | [-AZ]+[AZ] # else this one ) (?,-)\b # do not break at a -, followed by another boundary
The following regex matches all of the mentioned critera:
\b[a-z]*[A-Z]+[\-A-Z]+[A-Z]+\b
Please check here https://regex101.com/r/JNC4kN/1/
But this will fail if you give this type of example like aTH-THTH (small letter following hyphen and uppercase). If you want only UPPER-UPPER then follow this regex:
\b[a-z]{0,1}(?<!\-)[A-Z]+\b(?!\-)|\b[A-Z]+\-[A-Z]+\b
check here
You could use the following regular expression, which covers edge cases having to do with the word preceded by or followed by a hyphen (as shown at the link below):
(?<!\w|(?<=\w)-)(?:[a-zA-Z][A-Z]+|[A-Z]{2,}|[A-Z]+-[A-Z]+)(?!\w|-(?=\w))
Python's regex engine performs the following operations.
(?<! # begin a negative lookbehind
\w # match word char
| # or
(?<=\w) # match a word char in a positive lookbehind
- # match '-'
) # end negative lookbehind
(?: # begin non-cap grp
[a-zA-Z][A-Z]+ # match a lc letter then 1+ uc letters
| # or
[A-Z]{2,} # match 2+ uc letters
| # or
[A-Z]+-[A-Z]+ # match 1+ uc letters, '-', then 1+ uc letters
) # end non-cap grp
(?! # begin negative lookahead
\w # match word char
| # or
- # match '-'
(?=\w) # match a word char in a positive lookahead
) # end negative lookahead
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.