简体   繁体   中英

Split string by position not character

We know that anchors , word boundaries , and lookaround match at a position, rather than matching a character.
Is it possible to split a string by one of the preceding ways with regex (specifically in python)?

For example consider the following string:

"ThisisAtestForchEck,Match IngwithPosition." 

So i want the following result (the sub-strings that start with uppercase letter but not precede by space ):

['Thisis', 'Atest', 'Forch' ,'Eck,' ,'Match Ingwith' ,'Position.']

If i split with grouping i get:

>>> re.split(r'([A-Z])',s)
['', 'T', 'hisis', 'A', 'test', 'F', 'orch', 'E', 'ck,', 'M', 'atchingwith', 'P', 'osition.']

And this is the result with look-around :

>>> re.split(r'(?<=[A-Z])',s)
>>> re.split(r'((?<=[A-Z]))',s)
>>> re.split(r'((?<=[A-Z])?)',s)

Note that if i want to split by sub-strings that start with uppercase and are preceded by a space, eg:

['Thisis', 'Atest', 'Forch' ,'Eck,' ,'Match ', Ingwith' ,'Position.']

I can use re.findall , viz.:

>>> re.findall(r'([A-Z][^A-Z]*)',s)
['Thisis', 'Atest', 'Forch', 'Eck,', 'Match ', 'Ingwith', 'Position.']

But what about the first example: is it possible to solve it with re.findall ?


You can use this to split with regex module as re does not support split at 0 width assertions.

import regex
x="ThisisAtestForchEck,Match IngwithPosition."
print regex.split(r"(?<![\s])(?=[A-Z])",x,flags=regex.VERSION1)


print [i for i in regex.split(r"(?<![\s])(?=[A-Z])",x,flags=regex.VERSION1) if i]

See demo.


A way with re.findall :


When you decide to change your approach from split to findall , the first job consists to reformulate your requirements: "I want to split the string on each uppercase letter non preceded by a space" => "I want to find one or more substrings separed by space that begins with an uppercase letter except from the start of the string (if the string doesn't start with an uppercase letter) "

I know this might be less convenient because of the tuple nature of the result. But I think that this findall finds what you need:

re.findall(r'((?<!\s)[A-Z]([^A-Z]|(?<=\s)[A-Z])*)', s)
## returns [('Thisis', 's'), ('Atest', 't'), ('Forch', 'h'), ('Eck,', ','), ('Match Ingwith', 'h'), ('Position.', '.')]

This can be used in the following list comprehension to give the desired output:

[val[0] for val in re.findall(r'((?<!\s)[A-Z]([^A-Z]|(?<=\s)[A-Z])*)', s)]
## returns ['Thisis', 'Atest', 'Forch', 'Eck,', 'Match Ingwith', 'Position.']

And here is a hack that uses split :

re.split(r'((?<!\s)[A-Z]([^A-Z]|(?<=\s)[A-Z])*)', s)[1::3]
## returns ['Thisis', 'Atest', 'Forch', 'Eck,', 'Match Ingwith', 'Position.']

try capture using this pattern

([A-Z][a-z]*(?: [A-Z][a-z]*)*)


The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM