简体   繁体   English

按位置而不是字符分割字符串

[英]Split string by position not character

We know that anchors , word boundaries , and lookaround match at a position, rather than matching a character. 我们知道anchorsword boundarieslookaround在某个位置匹配,而不是匹配字符。
Is it possible to split a string by one of the preceding ways with regex (specifically in python)? 是否可以使用正则表达式(特别是在python中)通过上述方式之一拆分字符串?

For example consider the following string: 例如,考虑以下字符串:

"ThisisAtestForchEck,Match IngwithPosition." 

So i want the following result (the sub-strings that start with uppercase letter but not precede by space ): 所以我想要以下结果(以大写字母开头但不以空格开头的子字符串):

['Thisis', 'Atest', 'Forch' ,'Eck,' ,'Match Ingwith' ,'Position.']

If i split with grouping i get: 如果我分组分组,我得到:

>>> re.split(r'([A-Z])',s)
['', 'T', 'hisis', 'A', 'test', 'F', 'orch', 'E', 'ck,', 'M', 'atchingwith', 'P', 'osition.']

And this is the result with look-around : 这是环顾四周的结果:

>>> re.split(r'(?<=[A-Z])',s)
['ThisisAtestForchEck,MatchingwithPosition.']
>>> re.split(r'((?<=[A-Z]))',s)
['ThisisAtestForchEck,MatchingwithPosition.']
>>> re.split(r'((?<=[A-Z])?)',s)
['ThisisAtestForchEck,MatchingwithPosition.']

Note that if i want to split by sub-strings that start with uppercase and are preceded by a space, eg: 请注意,如果我想按以大写开头并以空格开头的子字符串进行分割,例如:

['Thisis', 'Atest', 'Forch' ,'Eck,' ,'Match ', Ingwith' ,'Position.']

I can use re.findall , viz.: 我可以使用re.findall ,即:

>>> re.findall(r'([A-Z][^A-Z]*)',s)
['Thisis', 'Atest', 'Forch', 'Eck,', 'Match ', 'Ingwith', 'Position.']

But what about the first example: is it possible to solve it with re.findall ? 但是第一个示例呢:可以用re.findall解决吗?

 (?<!\s)(?=[A-Z])

You can use this to split with regex module as re does not support split at 0 width assertions. 您可以使用它与regex模块拆分,因为re不支持在0宽度断言处拆分。

import regex
x="ThisisAtestForchEck,Match IngwithPosition."
print regex.split(r"(?<![\s])(?=[A-Z])",x,flags=regex.VERSION1)

or 要么

print [i for i in regex.split(r"(?<![\s])(?=[A-Z])",x,flags=regex.VERSION1) if i]

See demo. 参见演示。

https://regex101.com/r/sJ9gM7/65 https://regex101.com/r/sJ9gM7/65

A way with re.findall : 使用re.findall

re.findall(r'(?:[A-Z]|^[^A-Z\s])[^A-Z\s]*(?:\s+[A-Z][^A-Z]*)*',s)

When you decide to change your approach from split to findall , the first job consists to reformulate your requirements: "I want to split the string on each uppercase letter non preceded by a space" => "I want to find one or more substrings separed by space that begins with an uppercase letter except from the start of the string (if the string doesn't start with an uppercase letter) " 当您决定将方法从split更改为findall ,第一项工作是重新构成您的要求:“我想在每个大写字母上拆分字符串,但不带空格” =>“我想查找一个或多个分隔的子字符串由以大写字母开头的空格从字符串开头除外) (如果字符串不是以大写字母开头)

I know this might be less convenient because of the tuple nature of the result. 我知道由于结果的元组性质,这可能不太方便。 But I think that this findall finds what you need: 但是我认为这个findall可以找到您需要的东西:

re.findall(r'((?<!\s)[A-Z]([^A-Z]|(?<=\s)[A-Z])*)', s)
## returns [('Thisis', 's'), ('Atest', 't'), ('Forch', 'h'), ('Eck,', ','), ('Match Ingwith', 'h'), ('Position.', '.')]

This can be used in the following list comprehension to give the desired output: 可以在下面的列表理解中使用它来提供所需的输出:

[val[0] for val in re.findall(r'((?<!\s)[A-Z]([^A-Z]|(?<=\s)[A-Z])*)', s)]
## returns ['Thisis', 'Atest', 'Forch', 'Eck,', 'Match Ingwith', 'Position.']

And here is a hack that uses split : 这是一个使用split的hack:

re.split(r'((?<!\s)[A-Z]([^A-Z]|(?<=\s)[A-Z])*)', s)[1::3]
## returns ['Thisis', 'Atest', 'Forch', 'Eck,', 'Match Ingwith', 'Position.']

try capture using this pattern 尝试使用此模式捕获

([A-Z][a-z]*(?: [A-Z][a-z]*)*)

Demo 演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM