简体   繁体   English

拆分具有多个分隔符的字符串,并保留 *一些* 分隔符,但不是全部

[英]Split a string with multiple delimiters, and keep *some* of the delimiters, but not all

I have a string that can look something like this:我有一个看起来像这样的字符串:

1. "foo bar"
2. "foo bar foo:bar"
3. "foo bar "
4. "foo bar      "
5. "foo bar foo:bar:baz"

I want to split this string so that it would end up with the following results:我想拆分这个字符串,以便它最终得到以下结果:

1. ['foo', 'bar']
2. ['foo', 'bar', 'foo', ':', 'bar']
3. / 4. ['foo', 'bar', '']
5. ['foo', 'bar', 'foo', ':', 'bar', ':', 'baz']

In other words, following these rules:换句话说,遵循以下规则:

  1. Split the string on every occurrence of a space.在每次出现空格时拆分字符串。

    a.一个。 If there are one or more spaces at the end of a string, add one empty string to the split list如果字符串末尾有一个或多个空格,则在拆分列表中添加一个空字符串

    b.湾。 Any spaces before the last non-space character in a string should be consumed, and not add to the split list.字符串中最后一个非空格字符之前的任何空格都应该被消耗掉,而不是添加到拆分列表中。

  2. Split the string on every occurrence of a colon, and do not consume the colon.在每次出现冒号时拆分字符串,并且不要使用冒号。

The XY problem is this, in case it's relevant: XY问题是这样的,如果它是相关的:

I want to mimic Bash tab-completion behaviour.我想模仿 Bash 制表符完成行为。 When you type a command into a Bash interpreter, it will split the command into an array COMP_WORDS , and it will follow the above rules - splitting the words based on spaces and colons, with colons placed into their own array element, and spaces ignored unless they're at the end of a string.当您在 Bash 解释器中键入命令时,它会将命令拆分为数组COMP_WORDS ,并且它将遵循上述规则 - 根据空格和冒号拆分单词,冒号放置在自己的数组元素中,空格忽略,除非它们位于字符串的末尾。 I want to recreate this behaviour in Python, given a string that looks like a command that a user would type.我想在 Python 中重新创建此行为,给定一个看起来像用户键入的命令的字符串。

I've seen this question about splitting a string and keeping the separators using re.split .我见过这个关于拆分字符串并使用re.split保留分隔符的问题。 And this question about splitting using multiple delimiters.还有这个关于使用多个分隔符进行拆分的问题。 But my use case is more complicated, and neither question seems to cover it.但我的用例更复杂,似乎两个问题都没有涵盖它。 I tried the following to at least split on spaces and colons:我尝试了以下至少在空格和冒号上拆分:

print(re.split('(:)|(?: )', splitstr))

But even that doesn't work.但即使这样也行不通。 When splitstr is "foo bar foo:bar" returns this:splitstr为 "foo bar foo:bar" 时返回:

['foo', None, 'bar', None, 'foo', ':', 'bar']

Any idea how this could be done in Python?知道如何在 Python 中做到这一点吗?

EDIT: My requirements weren't clear - I would want "foo bar " (with any number of spaces at the end) to return the list ["foo", "bar", ""] (with just one empty string at the end of the list.)编辑:我的要求不明确 - 我希望“foo bar”(末尾有任意数量的空格)返回列表["foo", "bar", ""] (只有一个空字符串列表的末尾。)

There is no need to use regular expressions for this task.此任务无需使用正则表达式。 String methods work just as well, and might be more readable.字符串方法同样有效,并且可能更具可读性。

def split_comp(s: str) -> 'list[str]':
    trailing = s.endswith(' ')
    s = s.replace(':', ' : ')  # insert split marks before/after every colon
    parts = s.split()
    return parts if not trailing else [*parts, ' ']

This technique can be used for any delimiters – pick one delimiter to split on, then replace/pad those to remove/keep with it.此技术可用于任何分隔符 - 选择一个分隔符进行拆分,然后替换/填充那些以删除/保留它。

You can use a re.findall approach here with:您可以在此处使用re.findall方法:

[^:\s]+|:|(?<=\S)(?=\s+$)

See the regex demo .请参阅正则表达式演示 Details :详情

  • [^:\s]+ - one or more chars other than whitespace and : [^:\s]+ - 一个或多个字符,而不是空格和:
  • | - or - 或者
  • : - a colon : - 一个冒号
  • | - or - 或者
  • (?<=\S)(?=\s+$) - any empty string that is located between a non-whitespace and one or more whitespaces at the end of string. (?<=\S)(?=\s+$) - 位于非空格和字符串末尾的一个或多个空格之间的任何空字符串。

See the Python demo .请参阅Python 演示

import re
l = ['foo bar', 'foo bar foo:bar', 'foo bar ', 'foo     bar     ']
rx = re.compile(r'[^:\s]+|:|(?<=\S)(?=\s+$)')
for s in l:
    if s.rstrip() != s:
        s = s.rstrip() + " "
    print(f"'{s}'", '=>', rx.findall(s))

Output: Output:

'foo bar' => ['foo', 'bar']
'foo bar foo:bar' => ['foo', 'bar', 'foo', ':', 'bar']
'foo bar ' => ['foo', 'bar', '']
'foo     bar ' => ['foo', 'bar', '']

Maybe there are shorter ways, but here is my suggestion:也许有更短的方法,但这是我的建议:

def func(s):
    if s[-1]==' ':
        l=s.split()+['']
    else:
        l=s.split()
    def f(l):
        m=l.copy()
        res=[]
        for i in m:
            if i!=':' and ':' in i:
                temp=[i[:i.find(':')]]+[':']+[i[i.find(':')+1:]]
                res.extend(temp)
            else:
                res.append(i)
        return res
    while any(i!=':' and ':' in i for i in l):
        l=f(l)
    return l

Examples:例子:

>>> func("foo bar")
['foo', 'bar']

>>> func("foo bar foo:bar")
['foo', 'bar', 'foo', ':', 'bar']

>>> func("foo bar ")
['foo', 'bar', '']

>>> func("foo bar      ")
['foo', 'bar', '']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM