I'm trying to split sentences from different characters, where each word has its own tag, and store with indices, and names can be Mike or Steve with different lengths. Content can be multiple languages like Chinese or Japanese, etc.
content = "A:Hello.B:How are you?A:I'm fine."
which I want to be like:
[0]A:Hello. , 0:7
[1]B:How are you? , 8:21
[2]A:I'm fine. ,22:33
You can use re.split
as follow:
import re
s = "A:Hello.B:How are you?A:I'm fine."
t = re.split(r'[.?]', s)
print(t)
that gives
['A:Hello', 'B:How are you', "A:I'm fine", '']
You can use re.finditer
for the task:
import re
content = "A:Hello.B:How are you?A:I'm fine."
for idx, i in enumerate(re.finditer(r'(.*?[.?])(?=[A-Z]|\Z)', content)):
print('[{}]{:<20}, {}:{}'.format(idx, i.group(1), i.start(), i.end()-1))
Prints:
[0]A:Hello. , 0:7
[1]B:How are you? , 8:21
[2]A:I'm fine. , 22:32
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.