简体   繁体   English

有效地在多个字符串分隔符上拆分python字符串

[英]split python string on multiple string delimiters efficiently

Suppose I have a string such as "Let's split this string into many small ones" and I want to split it on this , into and ones 假设我有一个字符串,如"Let's split this string into many small ones" ,我想将它拆分在thisintoones

such that the output looks something like this: 这样输出看起来像这样:

["Let's split", "this string", "into many small", "ones"]

What is the most efficient way to do it? 最有效的方法是什么?

With a lookahead. 带着前瞻。

>>> re.split(r'\s(?=(?:this|into|ones)\b)', "Let's split this string into many small ones")
["Let's split", 'this string', 'into many small', 'ones']

By using re.split() : 通过使用re.split()

>>> re.split(r'(this|into|ones)', "Let's split this string into many small ones")
["Let's split ", 'this', ' string ', 'into', ' many small ', 'ones', '']

By putting the words to split on in a capturing group, the output includes the words we split on. 通过在捕获组中放置要拆分的单词,输出包括我们拆分的单词。

If you need the spaces removed, use map(str.strip, result) on the re.split() output: 如果需要删除空格,请在re.split()输出中使用map(str.strip, result)

>>> map(str.strip, re.split(r'(this|into|ones)', "Let's split this string into many small ones"))
["Let's split", 'this', 'string', 'into', 'many small', 'ones', '']

and you could use filter(None, result) to remove any empty strings if need be: 如果需要,您可以使用filter(None, result)删除任何空字符串:

>>> filter(None, map(str.strip, re.split(r'(this|into|ones)', "Let's split this string into many small ones")))
["Let's split", 'this', 'string', 'into', 'many small', 'ones']

To split on words but keep them attached to the following group, you need to use a lookahead assertion instead: 要拆分单词但将它们连接到以下组,您需要使用先行断言:

>>> re.split(r'\s(?=(?:this|into|ones)\b)', "Let's split this string into many small ones")
["Let's split", 'this string', 'into many small', 'ones']

Now we are really splitting on whitespace, but only on whitespace that is followed by a whole word, one in the set of this , into and ones . 现在,我们真的是在分裂空白,但仅限于后跟整个字,组中的一个空白的thisintoones

Here's a fairly lazy way to do it: 这是一种相当懒惰的方式:

import re

def resplit(regex,s):
    current = None
    for x in regex.finditer(s):
        start = x.start()
        yield s[current:start]
        current = start
    yield s[start:]

s = "Let's split this string into many small ones"
regex = re.compile('(this|into|ones)')
print list( resplit(regex,s) )

I don't know for sure if this is the most efficient, but it's pretty clean. 我不确定这是否是最有效的,但它非常干净。

Basically, we just iterate through the matches taking 1 piece at a time. 基本上,我们只是一次性迭代一次比赛。 The pieces are determined by the index in the string ( s ) where the regex starts to match. 所述片由字符串(在索引确定s其中正则表达式开始相匹配)。 We just chop the string up until that point and we save that index as the start point of the next slice. 我们只是切断字符串直到那一点,我们将该索引保存为下一个切片的起点。


As for performance, ignacio clearly wins this round: 至于表现,ignacio显然赢得了这一轮:

9.1412050724  -- Me
3.09771895409  -- ignacio

Code: 码:

import re

def resplit(regex,s):
    current = None
    for x in regex.finditer(s):
        start = x.start()
        yield s[current:start]
        current = start
    yield s[start:]


def me(regex,s):
    return list(resplit(regex,s))

def ignacio(regex,s):
    return regex.split("Let's split this string into many small ones")

s = "Let's split this string into many small ones"
regex = re.compile('(this|into|ones)')
regex2 = re.compile(r'\s(?=(?:this|into|ones)\b)')

import timeit
print timeit.timeit("me(regex,s)","from __main__ import me,regex,s")
print timeit.timeit("ignacio(regex2,s)","from __main__ import ignacio,regex2,s")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM