简体   繁体   中英

Python String split by specific pattern with Indices

I'm trying to split sentences from different characters, where each word has its own tag, and store with indices, and names can be Mike or Steve with different lengths. Content can be multiple languages like Chinese or Japanese, etc.

content = "A:Hello.B:How are you?A:I'm fine."

which I want to be like:

[0]A:Hello.       , 0:7
[1]B:How are you? , 8:21
[2]A:I'm fine.    ,22:33

You can use re.split as follow:

import re
s = "A:Hello.B:How are you?A:I'm fine."
t = re.split(r'[.?]', s)
print(t)

that gives

['A:Hello', 'B:How are you', "A:I'm fine", '']

You can use re.finditer for the task:

import re

content = "A:Hello.B:How are you?A:I'm fine."

for idx, i in enumerate(re.finditer(r'(.*?[.?])(?=[A-Z]|\Z)', content)):
    print('[{}]{:<20}, {}:{}'.format(idx, i.group(1), i.start(), i.end()-1))

Prints:

[0]A:Hello.            , 0:7
[1]B:How are you?      , 8:21
[2]A:I'm fine.         , 22:32

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM