拆分具有多个分隔符的字符串，并保留一些分隔符，但不是全部

Question

I have a string that can look something like this:我有一个看起来像这样的字符串：

1. "foo bar"
2. "foo bar foo:bar"
3. "foo bar "
4. "foo bar      "
5. "foo bar foo:bar:baz"

I want to split this string so that it would end up with the following results:我想拆分这个字符串，以便它最终得到以下结果：

1. ['foo', 'bar']
2. ['foo', 'bar', 'foo', ':', 'bar']
3. / 4. ['foo', 'bar', '']
5. ['foo', 'bar', 'foo', ':', 'bar', ':', 'baz']

In other words, following these rules:换句话说，遵循以下规则：

Split the string on every occurrence of a space.在每次出现空格时拆分字符串。
a.一个。 If there are one or more spaces at the end of a string, add one empty string to the split list如果字符串末尾有一个或多个空格，则在拆分列表中添加一个空字符串
b.湾。 Any spaces before the last non-space character in a string should be consumed, and not add to the split list.字符串中最后一个非空格字符之前的任何空格都应该被消耗掉，而不是添加到拆分列表中。
Split the string on every occurrence of a colon, and do not consume the colon.在每次出现冒号时拆分字符串，并且不要使用冒号。

The XY problem is this, in case it's relevant: XY问题是这样的，如果它是相关的：

I want to mimic Bash tab-completion behaviour.我想模仿 Bash 制表符完成行为。 When you type a command into a Bash interpreter, it will split the command into an array COMP_WORDS , and it will follow the above rules - splitting the words based on spaces and colons, with colons placed into their own array element, and spaces ignored unless they're at the end of a string.当您在 Bash 解释器中键入命令时，它会将命令拆分为数组COMP_WORDS ，并且它将遵循上述规则 - 根据空格和冒号拆分单词，冒号放置在自己的数组元素中，空格忽略，除非它们位于字符串的末尾。 I want to recreate this behaviour in Python, given a string that looks like a command that a user would type.我想在 Python 中重新创建此行为，给定一个看起来像用户键入的命令的字符串。

I've seen this question about splitting a string and keeping the separators using re.split .我见过这个关于拆分字符串并使用re.split保留分隔符的问题。 And this question about splitting using multiple delimiters.还有这个关于使用多个分隔符进行拆分的问题。 But my use case is more complicated, and neither question seems to cover it.但我的用例更复杂，似乎两个问题都没有涵盖它。 I tried the following to at least split on spaces and colons:我尝试了以下至少在空格和冒号上拆分：

print(re.split('(:)|(?: )', splitstr))

But even that doesn't work.但即使这样也行不通。 When splitstr is "foo bar foo:bar" returns this:当splitstr为 "foo bar foo:bar" 时返回：

['foo', None, 'bar', None, 'foo', ':', 'bar']

Any idea how this could be done in Python?知道如何在 Python 中做到这一点吗？

EDIT: My requirements weren't clear - I would want "foo bar " (with any number of spaces at the end) to return the list ["foo", "bar", ""] (with just one empty string at the end of the list.)编辑：我的要求不明确 - 我希望“foo bar”（末尾有任意数量的空格）返回列表["foo", "bar", ""] （只有一个空字符串列表的末尾。）

Answer 1

There is no need to use regular expressions for this task.此任务无需使用正则表达式。 String methods work just as well, and might be more readable.字符串方法同样有效，并且可能更具可读性。

def split_comp(s: str) -> 'list[str]':
    trailing = s.endswith(' ')
    s = s.replace(':', ' : ')  # insert split marks before/after every colon
    parts = s.split()
    return parts if not trailing else [*parts, ' ']

This technique can be used for any delimiters – pick one delimiter to split on, then replace/pad those to remove/keep with it.此技术可用于任何分隔符 - 选择一个分隔符进行拆分，然后替换/填充那些以删除/保留它。

Answer 2

You can use a re.findall approach here with:您可以在此处使用re.findall方法：

[^:\s]+|:|(?<=\S)(?=\s+$)

See the regex demo .请参阅正则表达式演示。 Details :详情：

[^:\s]+ - one or more chars other than whitespace and : [^:\s]+ - 一个或多个字符，而不是空格和:
| - or - 或者
: - a colon : - 一个冒号
| - or - 或者
(?<=\S)(?=\s+$) - any empty string that is located between a non-whitespace and one or more whitespaces at the end of string. (?<=\S)(?=\s+$) - 位于非空格和字符串末尾的一个或多个空格之间的任何空字符串。

See the Python demo .请参阅Python 演示。

import re
l = ['foo bar', 'foo bar foo:bar', 'foo bar ', 'foo     bar     ']
rx = re.compile(r'[^:\s]+|:|(?<=\S)(?=\s+$)')
for s in l:
    if s.rstrip() != s:
        s = s.rstrip() + " "
    print(f"'{s}'", '=>', rx.findall(s))

Output: Output：

'foo bar' => ['foo', 'bar']
'foo bar foo:bar' => ['foo', 'bar', 'foo', ':', 'bar']
'foo bar ' => ['foo', 'bar', '']
'foo     bar ' => ['foo', 'bar', '']

Answer 3

Maybe there are shorter ways, but here is my suggestion:也许有更短的方法，但这是我的建议：

def func(s):
    if s[-1]==' ':
        l=s.split()+['']
    else:
        l=s.split()
    def f(l):
        m=l.copy()
        res=[]
        for i in m:
            if i!=':' and ':' in i:
                temp=[i[:i.find(':')]]+[':']+[i[i.find(':')+1:]]
                res.extend(temp)
            else:
                res.append(i)
        return res
    while any(i!=':' and ':' in i for i in l):
        l=f(l)
    return l

Examples:例子：

>>> func("foo bar")
['foo', 'bar']

>>> func("foo bar foo:bar")
['foo', 'bar', 'foo', ':', 'bar']

>>> func("foo bar ")
['foo', 'bar', '']

>>> func("foo bar      ")
['foo', 'bar', '']

拆分具有多个分隔符的字符串，并保留一些分隔符，但不是全部

问题描述

3 个解决方案

解决方案1
3 已采纳 2021-01-13 12:17:49

解决方案2
1 2021-01-13 11:51:43

解决方案3
1 2021-01-13 12:12:48

拆分具有多个分隔符的字符串，并保留 *一些* 分隔符，但不是全部

问题描述

3 个解决方案

解决方案1 3 已采纳 2021-01-13 12:17:49

解决方案2 1 2021-01-13 11:51:43

解决方案3 1 2021-01-13 12:12:48

拆分具有多个分隔符的字符串，并保留一些分隔符，但不是全部

解决方案1
3 已采纳 2021-01-13 12:17:49

解决方案2
1 2021-01-13 11:51:43

解决方案3
1 2021-01-13 12:12:48