简体   繁体   中英

Splitting on group of capital letters in python

I'm trying to tokenize a number of strings using a capital letter as a delimited. I have landed on the following code:

token = ([a for a in re.split(r'([A-Z][a-z]*)', "ABCowDog") if a])

print token

And I get this, as expected, in return:

['A', 'B', 'Cow', 'Dog']

Now, this is just an example string to make life easier, but in my case I want to go through this list and find individual characters (easy enough with checking len()) and putting the individual letters together, provided they meet a prior definition. In the example above the strings 'AB', 'Cow', and 'Dog' are the strings I actually want to form (consecutive capitals are part of an acronym). For whatever reason, once I have my token, I am unable to figure out how to walk the list. Sorry if this is a simple answer, but I'm fairly new to python and am sick of banging my head against the wall.

re.split isn't always easy to use and seems sometimes limited in many situations. You can try a different approach with re.findall :

>>> s = 'ABCowDog'
>>> re.findall(r'[A-Z](?:[A-Z]*(?![a-z])|[a-z]*)', s)
['AB', 'Cow', 'Dog']

You can use the following to split with regex module :

(?=[A-Z][a-z])

See DEMO

Code:

regex.split(r'(?=[A-Z][a-z])', "ABCowDog",flags=regex.VERSION1)
([A-Z][a-z]+)

你应该这样拆分。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM