简体   繁体   中英

Multiple regex in Python findall

Say I have a string : "She has an excellent command on the topicsOnly problem is clarity in EnglishHer confidence is very good in RUSSian and H2O"

If observed properly, this string doesnt have any punctuation. I am primarily focusing on putting the periods. "She has an excellent command on the topics. Only problem is clarity in English. Her confidence is very good in RUSSian and H2O" I can use a regex and findall to get a list of relevant words. I tried using something like this, but its not giving the desired result. I would like a computationally efficient code.

import re

text = "She has an excelllent command on the topicsOnly problem is clarity in EnglishHer confidence is very good in RUSSian and H2O"

r = re.findall('([A-Z][a-z]+)|([a-zA-Z0-9]+)|([A-Z][a-z]+)', text)

I tried something like that with the PCRE engine : (\\p{Ll}+)(\\p{Lu}\\p{Ll}*)

You can test it here: https://regex101.com/r/tqIcdS/1

The idea is to use the \\p{L} to find any word character (like \\w ) but with handling unicode chars that might have accents (ex: " Le pain, je l'ai mangéEnsuite j'ai bu un verre de vin ").

  • \\p{Ll} matches a lowercase unicode word character.

  • \\p{Lu} matches an uppercase unicode word character.

I also captured the characters before and after to match the whole word.

Unfortunately, Python 's default re library doesn't support it.

But thanks to Wiktor's comment below, you could use the PyPi regex library: https://pypi.org/project/regex/

You can use built-in Python re for both ASCII and fully Unicode-aware solutions:

import re, sys

pLu = '[{}]'.format("".join([chr(i) for i in range(sys.maxunicode) if chr(i).isupper()]))
pLl = '[{}]'.format("".join([chr(i) for i in range(sys.maxunicode) if chr(i).islower()]))

text = "She has an excelllent command on the topicsOnly problem is clarity in EnglishHer confidence is very good in RUSSian and H2O"
print( re.sub(fr'({pLl})({pLu})', r'\1. \2', text) ) # Unicode-aware
# => She has an excelllent command on the topics. Only problem is clarity in English. Her confidence is very good in RUSSian and H2O
print( re.sub(fr'([a-z])([A-Z])', r'\1. \2', text) ) # ASCII only
# => She has an excelllent command on the topics. Only problem is clarity in English. Her confidence is very good in RUSSian and H2O

See the Python demo .

The main idea is to match and capture a lowercase letter and then an uppercase letter ( ([az])([AZ]) ) and replace with Group 1 value + . and space and then Group 2 value, where \\1 and \\2 are backreferences to these group values.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM