简体   繁体   English

拆分在python中的大写字母组

[英]Splitting on group of capital letters in python

I'm trying to tokenize a number of strings using a capital letter as a delimited. 我正在尝试使用大写字母作为分隔符号来标记许多字符串。 I have landed on the following code: 我已经登陆以下代码:

token = ([a for a in re.split(r'([A-Z][a-z]*)', "ABCowDog") if a])

print token

And I get this, as expected, in return: 正如预期的那样,我得到了这个回报:

['A', 'B', 'Cow', 'Dog'] ['A','B','牛','狗']

Now, this is just an example string to make life easier, but in my case I want to go through this list and find individual characters (easy enough with checking len()) and putting the individual letters together, provided they meet a prior definition. 现在,这只是一个让生活更轻松的示例字符串,但在我的情况下,我想通过此列表查找单个字符(检查len()很容易并将各个字母放在一起,前提是它们符合先前的定义。 In the example above the strings 'AB', 'Cow', and 'Dog' are the strings I actually want to form (consecutive capitals are part of an acronym). 在上面的例子中,字符串'AB','Cow'和'Dog'是我实际想要形成的字符串(连续大写是首字母缩略词的一部分)。 For whatever reason, once I have my token, I am unable to figure out how to walk the list. 无论出于何种原因,一旦我获得了令牌,我就无法弄清楚如何走到列表中。 Sorry if this is a simple answer, but I'm fairly new to python and am sick of banging my head against the wall. 对不起,如果这是一个简单的答案,但我对python很新,并且厌倦了撞到墙上。

re.split isn't always easy to use and seems sometimes limited in many situations. re.split并不总是易于使用,在许多情况下有时似乎有限。 You can try a different approach with re.findall : 您可以尝试使用re.findall的不同方法:

>>> s = 'ABCowDog'
>>> re.findall(r'[A-Z](?:[A-Z]*(?![a-z])|[a-z]*)', s)
['AB', 'Cow', 'Dog']

You can use the following to split with regex module : 您可以使用以下内容与regex模块分开:

(?=[A-Z][a-z])

See DEMO DEMO

Code: 码:

regex.split(r'(?=[A-Z][a-z])', "ABCowDog",flags=regex.VERSION1)
([A-Z][a-z]+)

你应该这样拆分。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM