简体   繁体   中英

Regex match, get all groups separated by a separator token?

I have a special format for encoding, and I would like a regex that extracts the encoded information. I have ':' as a special character that separates different 'blocks' of information. For example:

s = 'P:1:a:3:test_data'

Should get split to:

['P','1','a','3','test_data']

I can use:

s.split(':')

However, I can have a single ':' being encoded as well (there will never be more than 1 ':' grouped together, so there is no ambiguity). So for example:

s = 'P:1:::3:test_data'

Should give:

['P','1',':','3','test_data']

Using split(':') fails here:

['P', '1', '', '', '3', 'test_data']

What is the best way to capture that ':'? I am not very strong with regexes, I know regex groups can match atleast one character using '*+' but I am very confused on how to piece it all together. Better yet, is there a better way of doing it without regex? I guess I can always iterate over the array, check for consecutive empty string and combine them to ':'. Is there more elegant way of doing it?

Thanks

For your specific case, you can use negative look around to restrict the colon you want to split on (?<!:):|:(?!:) , which is a colon that is not preceded and followed by another colon at the same time:

import re
s = 'P:1:a:3:test_data'
s1 = 'P:1:::3:test_data'

re.split("(?<!:):|:(?!:)", s)
# ['P', '1', 'a', '3', 'test_data']

re.split("(?<!:):|:(?!:)", s1)
# ['P', '1', ':', '3', 'test_data']

Another option which is more general and can handle more than one : grouped with re.findall and (.+?)(?::|$) , ie lazily match at least one character until it finds a colon or reaches the end of string:

re.findall('(.+?)(?::|$)', 'P:1:::3:test_data')
# ['P', '1', ':', '3', 'test_data']

re.findall('(.+?)(?::|$)', 'P:1:::::3:test_data')
# ['P', '1', ':', ':', '3', 'test_data']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM