简体   繁体   中英

Python - Split string into characters while excluding a certain substring

I'm trying to split up a string of characters into a list while excluding certain substrings.

For example:

>>> sentences = '<s>I like dogs.</s><s>It\'s Monday today.</s>'
>>> substring1 = '<s>'
>>> substring2 = '</s>'
>>> print(split_string(sentences))
['<s>', 'I', ' ', 'l', 'i', 'k', 'e', ' ', 'd', 'o', 'g', 's', 
'.', '</s>', '<s>', 'I', 't', "'", 's', ' ', 'M', 'o', 'n', 'd',
'a', 'y', ' ', 't', 'o', 'd', 'a', 'y', '.', '</s>']

As you can see, the string is split up into characters, except for the listed substrings. How can I do this in Python?

You can use re.split :

import re
s = '<s>I like dogs.</s><s>It\'s Monday today.</s>'
result = [i for b in re.split('\<s\>|\</s\>', s) for i in ['<s>', *b, '</s>'] if b]

Output:

['<s>', 'I', ' ', 'l', 'i', 'k', 'e', ' ', 'd', 'o', 'g', 's', '.', '</s>', '<s>', 'I', 't', "'", 's', ' ', 'M', 'o', 'n', 'd', 'a', 'y', ' ', 't', 'o', 'd', 'a', 'y', '.', '</s>']

You could use re.findall for this. :)

import re
sentences = '<s>I like dogs.</s><s>It\'s Monday today.</s>'
print(re.findall(r'<\/?s>|.',sentences))

OUTPUT

['<s>', 'I', ' ', 'l', 'i', 'k', 'e', ' ', 'd', 'o', 'g', 's', '.', '</s>', '<s>', 'I', 't', "'", 's', ' ', 'M', 'o', 'n', 'd', 'a', 'y', ' ', 't', 'o', 'd', 'a', 'y', '.', '</s>']

Are you trying to exclude from the above output the <s> and </s> substrings?

If so:

>>> sentences = '<s>I like dogs.</s><s>It\'s Monday today</s>'
>>> substrings = ['<s>','<\s>']
>>> [character for character in split(sentences) if character not in substrings]

will give the expected output.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM