How to split a string into a list of predefined substrings of different lengths?

Question

Given a collection of predefined strings of unequal length, input a string, and split the string into occurrences of elements in the collection, the output should be unique for every input, and it should prefer the longest possible chunks.

For example, it should split s, c, h into different chunks, unless they are adjacent.

If "sc" appear together, it should be grouped into 'sc' and not as 's', 'c', similarly if "sh" appears then it must be grouped into 'sh', if "ch" appears then it should be grouped into 'ch', and finally "sch" should be grouped into 'sch'.

I only know string.split(delim) splits on specified delimiter, and re.split('\w{n}', string) splits string into chunks of equal lengths, both these methods don't give the intended result, how can this be done?

Pseudo code:

def phonemic_splitter(string):
    phonemes = ['a', 'sh', 's', 'g', 'n', 'c', 'e', 'ch', 'sch']
    output = do_something(string)
    return output

And example outputs:

phonemic_splitter('case') -> ['c', 'a', 's', 'e']
phonemic_splitter('ash') -> ['a', 'sh']
phonemic_splitter('change') -> ['ch', 'a', 'n', 'g', 'e']
phonemic_splitter('schane') -> ['sch', 'a', 'n', 'e']

Answer 1

Here is a possible solution:

def phonemic_splitter(s, phonemes):
    phonemes = sorted(phonemes, key=len, reverse=True)
    result = []
    while s:
        result.append(next(filter(s.startswith, phonemes)))
        s = s[len(result[-1]):]
    return result

This solution relies on the fact that phonemes contains a list of all the possible phonemes that can be found within the string s (otherwise, next could raise an exception).

One could also speed up this solution by implementing a binary search to be used in place of next .

Answer 2

You could use a regex:

import re 
cases=['case', 'ash', 'change', 'schane']

for e in cases:
    print(repr(e), '->', re.findall(r'sch|sh|ch|[a-z]', e))

Prints:

'case' -> ['c', 'a', 's', 'e']
'ash' -> ['a', 'sh']
'change' -> ['ch', 'a', 'n', 'g', 'e']
'schane' -> ['sch', 'a', 'n', 'e']

You could incorporate into your function this way:

import re 

def do_something(s, splits):
    pat='|'.join(sorted(
                   [f'{x}' for x in splits if len(x)>1],         
                    key=len, reverse=True))+'|[a-z]'
    return re.findall(pat, s)

def phonemic_splitter(string):
    phonemes = ['a', 'sh', 's', 'g', 'n', 'c', 'e', 'ch', 'sch']
    output = do_something(string, phonemes)
    return output

How to split a string into a list of predefined substrings of different lengths?

Question

2 answers

solution1
2 ACCPTED 2021-08-25 14:13:39

solution2
0 2021-08-25 14:45:37

How to split a string into a list of predefined substrings of different lengths?

Question

2 answers

solution1 2 ACCPTED 2021-08-25 14:13:39

solution2 0 2021-08-25 14:45:37

solution1
2 ACCPTED 2021-08-25 14:13:39

solution2
0 2021-08-25 14:45:37