简体   繁体   中英

Python split string by pattern

I have strings like "aaaaabbbbbbbbbbbbbbccccccccccc" . The number of the chars can differ and sometimes there can be dash inside the string, like "aaaaa-bbbbbbbbbbbbbbccccccccccc" .

Is there any smart way to either split it "aaaaa" , "bbbbbbbbbbbbbb" , "ccccccccccc" and get the indices of were it is split or just get the indices, without looping through every string? If the dash is between to patterns it can end up either in the left or right one as long it is always handled the same.

Any idea?

Regular expression MatchObject results include indices of the match. What remains is to match repeating characters:

import re

repeat = re.compile(r'(?P<start>[a-z])(?P=start)+-?')

would match only if a given letter character ( a - z ) is repeated at least once:

>>> for match in repeat.finditer("aaaaabbbbbbbbbbbbbbccccccccccc"):
...     print match.group(), match.start(), match.end()
... 
aaaaa 0 5
bbbbbbbbbbbbbb 5 19
ccccccccccc 19 30

The .start() and .end() methods on the match result give you the exact positions in the input string.

Dashes are included in the matches, but not non-repeating characters:

>>> for match in repeat.finditer("a-bb-cccccccc"):
...     print match.group(), match.start(), match.end()
... 
bb- 2 5
cccccccc 5 13

If you want the a- part to be a match, simply replace the + with a * multiplier:

repeat = re.compile(r'(?P<start>[a-z])(?P=start)*-?')

What about using itertools.groupby ?

>>> s = 'aaaaabbbbbbbbbbbbbbccccccccccc'
>>> from itertools import groupby
>>> [''.join(v) for k,v in groupby(s)]
['aaaaa', 'bbbbbbbbbbbbbb', 'ccccccccccc']

This will put the - as their own substrings which could easily be filtered out.

>>> s = 'aaaaa-bbbbbbbbbbbbbb-ccccccccccc'
>>> [''.join(v) for k,v in groupby(s) if k != '-']
['aaaaa', 'bbbbbbbbbbbbbb', 'ccccccccccc']
str="aaaaabbbbbbbbbbbbbbccccccccccc"
p = [0] 
for i, c in enumerate(zip(str, str[1:])):
    if c[0] != c[1]:
        p.append(i + 1)
print p

# [0, 5, 19]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM