简体   繁体   中英

How to split a string into a list of predefined substrings of different lengths?

Given a collection of predefined strings of unequal length, input a string, and split the string into occurrences of elements in the collection, the output should be unique for every input, and it should prefer the longest possible chunks.

For example, it should split s, c, h into different chunks, unless they are adjacent.

If "sc" appear together, it should be grouped into 'sc' and not as 's', 'c', similarly if "sh" appears then it must be grouped into 'sh', if "ch" appears then it should be grouped into 'ch', and finally "sch" should be grouped into 'sch'.

I only know string.split(delim) splits on specified delimiter, and re.split('\w{n}', string) splits string into chunks of equal lengths, both these methods don't give the intended result, how can this be done?

Pseudo code:

def phonemic_splitter(string):
    phonemes = ['a', 'sh', 's', 'g', 'n', 'c', 'e', 'ch', 'sch']
    output = do_something(string)
    return output

And example outputs:

phonemic_splitter('case') -> ['c', 'a', 's', 'e']
phonemic_splitter('ash') -> ['a', 'sh']
phonemic_splitter('change') -> ['ch', 'a', 'n', 'g', 'e']
phonemic_splitter('schane') -> ['sch', 'a', 'n', 'e']

Here is a possible solution:

def phonemic_splitter(s, phonemes):
    phonemes = sorted(phonemes, key=len, reverse=True)
    result = []
    while s:
        result.append(next(filter(s.startswith, phonemes)))
        s = s[len(result[-1]):]
    return result

This solution relies on the fact that phonemes contains a list of all the possible phonemes that can be found within the string s (otherwise, next could raise an exception).

One could also speed up this solution by implementing a binary search to be used in place of next .

You could use a regex:

import re 
cases=['case', 'ash', 'change', 'schane']

for e in cases:
    print(repr(e), '->', re.findall(r'sch|sh|ch|[a-z]', e))

Prints:

'case' -> ['c', 'a', 's', 'e']
'ash' -> ['a', 'sh']
'change' -> ['ch', 'a', 'n', 'g', 'e']
'schane' -> ['sch', 'a', 'n', 'e']

You could incorporate into your function this way:

import re 

def do_something(s, splits):
    pat='|'.join(sorted(
                   [f'{x}' for x in splits if len(x)>1],         
                    key=len, reverse=True))+'|[a-z]'
    return re.findall(pat, s)

def phonemic_splitter(string):
    phonemes = ['a', 'sh', 's', 'g', 'n', 'c', 'e', 'ch', 'sch']
    output = do_something(string, phonemes)
    return output

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM