简体   繁体   English

Python 中是否有 `string.split()` 的生成器版本?

[英]Is there a generator version of `string.split()` in Python?

string.split() returns a list instance.string.split()返回一个列表实例。 Is there a version that returns a generator instead?是否有返回生成器的版本? Are there any reasons against having a generator version?是否有任何理由反对使用生成器版本?

It is highly probable that re.finditer uses fairly minimal memory overhead. re.finditer很可能使用相当小的内存开销。

def split_iter(string):
    return (x.group(0) for x in re.finditer(r"[A-Za-z']+", string))

Demo:演示:

>>> list( split_iter("A programmer's RegEx test.") )
['A', "programmer's", 'RegEx', 'test']

edit: I have just confirmed that this takes constant memory in python 3.2.1, assuming my testing methodology was correct.编辑:我刚刚确认这在 python 3.2.1 中需要恒定内存,假设我的测试方法是正确的。 I created a string of very large size (1GB or so), then iterated through the iterable with a for loop (NOT a list comprehension, which would have generated extra memory).我创建了一个非常大的字符串(1GB 左右),然后用for循环遍历可迭代对象(不是列表理解,它会产生额外的内存)。 This did not result in a noticeable growth of memory (that is, if there was a growth in memory, it was far far less than the 1GB string).这并没有导致显着的内存增长(也就是说,如果内存有增长,远远小于 1GB 字符串)。

More general version:更通用的版本:

In reply to a comment "I fail to see the connection with str.split ", here is a more general version:在回复“我看不到与str.split ”的评论时,这里有一个更通用的版本:

def splitStr(string, sep="\s+"):
    # warning: does not yet work if sep is a lookahead like `(?=b)`
    if sep=='':
        return (c for c in string)
    else:
        return (_.group(1) for _ in re.finditer(f'(?:^|{sep})((?:(?!{sep}).)*)', string))
    # alternatively, more verbosely:
    regex = f'(?:^|{sep})((?:(?!{sep}).)*)'
    for match in re.finditer(regex, string):
        fragment = match.group(1)
        yield fragment

The idea is that ((?!pat).)* 'negates' a group by ensuring it greedily matches until the pattern would start to match (lookaheads do not consume the string in the regex finite-state-machine).这个想法是((?!pat).)*通过确保它贪婪地匹配直到模式开始匹配来'否定'一个组(前瞻不消耗正则表达式有限状态机中的字符串)。 In pseudocode: repeatedly consume ( begin-of-string xor {sep} ) + as much as possible until we would be able to begin again (or hit end of string)在伪代码中:重复使用 ( begin-of-string xor {sep} ) + as much as possible until we would be able to begin again (or hit end of string)

Demo:演示:

>>> splitStr('.......A...b...c....', sep='...')
<generator object splitStr.<locals>.<genexpr> at 0x7fe8530fb5e8>

>>> list(splitStr('A,b,c.', sep=','))
['A', 'b', 'c.']

>>> list(splitStr(',,A,b,c.,', sep=','))
['', '', 'A', 'b', 'c.', '']

>>> list(splitStr('.......A...b...c....', '\.\.\.'))
['', '', '.A', 'b', 'c', '.']

>>> list(splitStr('   A  b  c. '))
['', 'A', 'b', 'c.', '']

(One should note that str.split has an ugly behavior: it special-cases having sep=None as first doing str.strip to remove leading and trailing whitespace. The above purposefully does not do that; see the last example where sep= "\\s+" .) (应该注意str.split有一个丑陋的行为:它的特殊情况是sep=None作为第一次执行str.strip以删除前导和尾随空格。上面故意没有这样做;请参阅最后一个示例,其中 sep= "\\s+" .)

(I ran into various bugs (including an internal re.error) when trying to implement this... Negative lookbehind will restrict you to fixed-length delimiters so we don't use that. Almost anything besides the above regex seemed to result in errors with the beginning-of-string and end-of-string edge-cases (eg r'(.*?)($|,)' on ',,,a,,b,c' returns ['', '', '', 'a', '', 'b', 'c', ''] with an extraneous empty string at the end; one can look at the edit history for another seemingly-correct regex that actually has subtle bugs.) (我在尝试实现这个时遇到了各种错误(包括内部 re.error)......负后视将限制你使用固定长度的分隔符,所以我们不使用它。除了上面的正则表达式之外,几乎任何东西似乎都会导致字符串开头和字符串结尾边缘情况的错误(例如r'(.*?)($|,)' on ',,,a,,b,c'返回['', '', '', 'a', '', 'b', 'c', '']末尾有一个无关的空字符串;你可以查看另一个看似正确的正则表达式的编辑历史,但实际上有细微的错误.)

(If you want to implement this yourself for higher performance (although they are heavweight, regexes most importantly run in C), you'd write some code (with ctypes? not sure how to get generators working with it?), with the following pseudocode for fixed-length delimiters: Hash your delimiter of length L. Keep a running hash of length L as you scan the string using a running hash algorithm, O(1) update time. Whenever the hash might equal your delimiter, manually check if the past few characters were the delimiter; if so, then yield substring since last yield. Special case for beginning and end of string. This would be a generator version of the textbook algorithm to do O(N) text search. Multiprocessing versions are also possible. They might seem overkill, but the question implies that one is working with really huge strings... At that point you might consider crazy things like caching byte offsets if few of them, or working from disk with some disk-backed bytestring view object, buyin (如果你想自己实现它以获得更高的性能(尽管它们是重量级的,最重要的是在 C 中运行的正则表达式),你会写一些代码(使用 ctypes?不确定如何让生成器使用它?),如下固定长度分隔符的伪代码:散列长度为 L 的分隔符。在使用运行散列算法扫描字符串时保持长度为 L 的运行散列,更新时间为 O(1)。每当散列可能等于您的分隔符时,手动检查是否过去的几个字符是分隔符;如果是这样,则自上次产生以来产生子字符串。字符串开头和结尾的特殊情况。这将是进行 O(N) 文本搜索的教科书算法的生成器版本。多处理版本也是可能。他们可能看起来有点矫枉过正,但这个问题意味着一个人正在处理非常大的字符串......那时你可能会考虑一些疯狂的事情,比如缓存字节偏移量,如果它们很少,或者从磁盘使用一些磁盘支持的字节串视图反对,买进g more RAM, etc. etc.) g 更多 RAM 等)

The most efficient way I can think of it to write one using the offset parameter of the str.find() method.我能想到的最有效的方法是使用str.find()方法的offset参数编写一个。 This avoids lots of memory use, and relying on the overhead of a regexp when it's not needed.这避免了大量内存使用,并在不需要时依赖正则表达式的开销。

[edit 2016-8-2: updated this to optionally support regex separators] [编辑 2016-8-2:更新此内容以选择性地支持正则表达式分隔符]

def isplit(source, sep=None, regex=False):
    """
    generator version of str.split()

    :param source:
        source string (unicode or bytes)

    :param sep:
        separator to split on.

    :param regex:
        if True, will treat sep as regular expression.

    :returns:
        generator yielding elements of string.
    """
    if sep is None:
        # mimic default python behavior
        source = source.strip()
        sep = "\\s+"
        if isinstance(source, bytes):
            sep = sep.encode("ascii")
        regex = True
    if regex:
        # version using re.finditer()
        if not hasattr(sep, "finditer"):
            sep = re.compile(sep)
        start = 0
        for m in sep.finditer(source):
            idx = m.start()
            assert idx >= start
            yield source[start:idx]
            start = m.end()
        yield source[start:]
    else:
        # version using str.find(), less overhead than re.finditer()
        sepsize = len(sep)
        start = 0
        while True:
            idx = source.find(sep, start)
            if idx == -1:
                yield source[start:]
                return
            yield source[start:idx]
            start = idx + sepsize

This can be used like you want...这可以随心所欲地使用...

>>> print list(isplit("abcb","b"))
['a','c','']

While there is a little bit of cost seeking within the string each time find() or slicing is performed, this should be minimal since strings are represented as continguous arrays in memory.虽然每次执行 find() 或切片时在字符串中都会有一些成本搜索,但这应该是最小的,因为字符串在内存中表示为连续数组。

Did some performance testing on the various methods proposed (I won't repeat them here).对提出的各种方法进行了一些性能测试(我不会在这里重复)。 Some results:一些结果:

  • str.split (default = 0.3461570239996945 str.split (默认值 = 0.3461570239996945
  • manual search (by character) (one of Dave Webb's answer's) = 0.8260340550004912手动搜索(按字符)(戴夫韦伯的答案之一)= 0.8260340550004912
  • re.finditer (ninjagecko's answer) = 0.698872097000276 re.finditer (忍者的回答)= 0.698872097000276
  • str.find (one of Eli Collins's answers) = 0.7230395330007013 str.find (Eli Collins 的答案之一)= 0.7230395330007013
  • itertools.takewhile (Ignacio Vazquez-Abrams's answer) = 2.023023967998597 itertools.takewhile (Ignacio Vazquez-Abrams 的回答) = 2.023023967998597
  • str.split(..., maxsplit=1) recursion = N/A† str.split(..., maxsplit=1)递归 = N/A†

†The recursion answers ( string.split with maxsplit = 1 ) fail to complete in a reasonable time, given string.split s speed they may work better on shorter strings, but then I can't see the use-case for short strings where memory isn't an issue anyway. †递归答案( string.splitmaxsplit = 1 )无法在合理的时间内完成,鉴于string.split的速度,它们可能在较短的字符串上工作得更好,但是我看不到短字符串的用例在哪里反正内存不是问题。

Tested using timeit on:使用timeit测试:

the_text = "100 " * 9999 + "100"

def test_function( method ):
    def fn( ):
        total = 0

        for x in method( the_text ):
            total += int( x )

        return total

    return fn

This raises another question as to why string.split is so much faster despite its memory usage.这引发了另一个问题,即尽管string.split使用了内存,但它的速度却如此之快。

This is generator version of split() implemented via re.search() that does not have the problem of allocating too many substrings.这是通过re.search() split()实现的split()生成器版本,它没有分配太多子字符串的问题。

import re

def itersplit(s, sep=None):
    exp = re.compile(r'\s+' if sep is None else re.escape(sep))
    pos = 0
    while True:
        m = exp.search(s, pos)
        if not m:
            if pos < len(s) or sep is not None:
                yield s[pos:]
            break
        if pos < m.start() or sep is not None:
            yield s[pos:m.start()]
        pos = m.end()


sample1 = "Good evening, world!"
sample2 = " Good evening, world! "
sample3 = "brackets][all][][over][here"
sample4 = "][brackets][all][][over][here]["

assert list(itersplit(sample1)) == sample1.split()
assert list(itersplit(sample2)) == sample2.split()
assert list(itersplit(sample3, '][')) == sample3.split('][')
assert list(itersplit(sample4, '][')) == sample4.split('][')

EDIT: Corrected handling of surrounding whitespace if no separator chars are given.编辑:如果没有给出分隔符,则更正了对周围空白的处理。

Here is my implementation, which is much, much faster and more complete than the other answers here.这是我的实现,它比这里的其他答案要快得多,也更完整。 It has 4 separate subfunctions for different cases.对于不同的情况,它有 4 个独立的子功能。

I'll just copy the docstring of the main str_split function:我将复制主str_split函数的文档字符串:


str_split(s, *delims, empty=None)

Split the string s by the rest of the arguments, possibly omitting empty parts ( empty keyword argument is responsible for that).用其余的参数分割字符串s ,可能省略空部分( empty关键字参数负责)。 This is a generator function.这是一个生成器函数。

When only one delimiter is supplied, the string is simply split by it.当只提供一个分隔符时,字符串会被它简单地分割。 empty is then True by default.默认情况下, emptyTrue

str_split('[]aaa[][]bb[c', '[]')
    -> '', 'aaa', '', 'bb[c'
str_split('[]aaa[][]bb[c', '[]', empty=False)
    -> 'aaa', 'bb[c'

When multiple delimiters are supplied, the string is split by longest possible sequences of those delimiters by default, or, if empty is set to True , empty strings between the delimiters are also included.当提供多个分隔符时,默认情况下字符串会被这些分隔符的最长可能序列分割,或者,如果empty设置为True ,则还包括分隔符之间的空字符串。 Note that the delimiters in this case may only be single characters.请注意,这种情况下的分隔符只能是单个字符。

str_split('aaa, bb : c;', ' ', ',', ':', ';')
    -> 'aaa', 'bb', 'c'
str_split('aaa, bb : c;', *' ,:;', empty=True)
    -> 'aaa', '', 'bb', '', '', 'c', ''

When no delimiters are supplied, string.whitespace is used, so the effect is the same as str.split() , except this function is a generator.当不提供分隔符时,使用string.whitespace ,所以效果与str.split()相同,只是这个函数是一个生成器。

str_split('aaa\\t  bb c \\n')
    -> 'aaa', 'bb', 'c'

import string

def _str_split_chars(s, delims):
    "Split the string `s` by characters contained in `delims`, including the \
    empty parts between two consecutive delimiters"
    start = 0
    for i, c in enumerate(s):
        if c in delims:
            yield s[start:i]
            start = i+1
    yield s[start:]

def _str_split_chars_ne(s, delims):
    "Split the string `s` by longest possible sequences of characters \
    contained in `delims`"
    start = 0
    in_s = False
    for i, c in enumerate(s):
        if c in delims:
            if in_s:
                yield s[start:i]
                in_s = False
        else:
            if not in_s:
                in_s = True
                start = i
    if in_s:
        yield s[start:]


def _str_split_word(s, delim):
    "Split the string `s` by the string `delim`"
    dlen = len(delim)
    start = 0
    try:
        while True:
            i = s.index(delim, start)
            yield s[start:i]
            start = i+dlen
    except ValueError:
        pass
    yield s[start:]

def _str_split_word_ne(s, delim):
    "Split the string `s` by the string `delim`, not including empty parts \
    between two consecutive delimiters"
    dlen = len(delim)
    start = 0
    try:
        while True:
            i = s.index(delim, start)
            if start!=i:
                yield s[start:i]
            start = i+dlen
    except ValueError:
        pass
    if start<len(s):
        yield s[start:]


def str_split(s, *delims, empty=None):
    """\
Split the string `s` by the rest of the arguments, possibly omitting
empty parts (`empty` keyword argument is responsible for that).
This is a generator function.

When only one delimiter is supplied, the string is simply split by it.
`empty` is then `True` by default.
    str_split('[]aaa[][]bb[c', '[]')
        -> '', 'aaa', '', 'bb[c'
    str_split('[]aaa[][]bb[c', '[]', empty=False)
        -> 'aaa', 'bb[c'

When multiple delimiters are supplied, the string is split by longest
possible sequences of those delimiters by default, or, if `empty` is set to
`True`, empty strings between the delimiters are also included. Note that
the delimiters in this case may only be single characters.
    str_split('aaa, bb : c;', ' ', ',', ':', ';')
        -> 'aaa', 'bb', 'c'
    str_split('aaa, bb : c;', *' ,:;', empty=True)
        -> 'aaa', '', 'bb', '', '', 'c', ''

When no delimiters are supplied, `string.whitespace` is used, so the effect
is the same as `str.split()`, except this function is a generator.
    str_split('aaa\\t  bb c \\n')
        -> 'aaa', 'bb', 'c'
"""
    if len(delims)==1:
        f = _str_split_word if empty is None or empty else _str_split_word_ne
        return f(s, delims[0])
    if len(delims)==0:
        delims = string.whitespace
    delims = set(delims) if len(delims)>=4 else ''.join(delims)
    if any(len(d)>1 for d in delims):
        raise ValueError("Only 1-character multiple delimiters are supported")
    f = _str_split_chars if empty else _str_split_chars_ne
    return f(s, delims)

This function works in Python 3, and an easy, though quite ugly, fix can be applied to make it work in both 2 and 3 versions.这个函数在 Python 3 中工作,并且可以应用一个简单但很丑陋的修复来使其在 2 和 3 版本中工作。 The first lines of the function should be changed to:该函数的第一行应更改为:

def str_split(s, *delims, **kwargs):
    """...docstring..."""
    empty = kwargs.get('empty')

I wrote a version of @ninjagecko's answer that behaves more like string.split (ie whitespace delimited by default and you can specify a delimiter).我写了一个@ninjagecko 答案的版本,它的行为更像 string.split(即默认情况下用空格分隔,您可以指定一个分隔符)。

def isplit(string, delimiter = None):
    """Like string.split but returns an iterator (lazy)

    Multiple character delimters are not handled.
    """

    if delimiter is None:
        # Whitespace delimited by default
        delim = r"\s"

    elif len(delimiter) != 1:
        raise ValueError("Can only handle single character delimiters",
                        delimiter)

    else:
        # Escape, incase it's "\", "*" etc.
        delim = re.escape(delimiter)

    return (x.group(0) for x in re.finditer(r"[^{}]+".format(delim), string))

Here are the tests I used (in both python 3 and python 2):以下是我使用的测试(在 python 3 和 python 2 中):

# Wrapper to make it a list
def helper(*args,  **kwargs):
    return list(isplit(*args, **kwargs))

# Normal delimiters
assert helper("1,2,3", ",") == ["1", "2", "3"]
assert helper("1;2;3,", ";") == ["1", "2", "3,"]
assert helper("1;2 ;3,  ", ";") == ["1", "2 ", "3,  "]

# Whitespace
assert helper("1 2 3") == ["1", "2", "3"]
assert helper("1\t2\t3") == ["1", "2", "3"]
assert helper("1\t2 \t3") == ["1", "2", "3"]
assert helper("1\n2\n3") == ["1", "2", "3"]

# Surrounding whitespace dropped
assert helper(" 1 2  3  ") == ["1", "2", "3"]

# Regex special characters
assert helper(r"1\2\3", "\\") == ["1", "2", "3"]
assert helper(r"1*2*3", "*") == ["1", "2", "3"]

# No multi-char delimiters allowed
try:
    helper(r"1,.2,.3", ",.")
    assert False
except ValueError:
    pass

python's regex module says that it does "the right thing" for unicode whitespace, but I haven't actually tested it. python 的正则表达式模块说它对 unicode 空格做了“正确的事情” ,但我还没有真正测试过它。

Also available as a gist .也可作为要点

If you would also like to be able to read an iterator (as well as return one) try this:如果您还希望能够读取迭代器(以及返回一个),请尝试以下操作:

import itertools as it

def iter_split(string, sep=None):
    sep = sep or ' '
    groups = it.groupby(string, lambda s: s != sep)
    return (''.join(g) for k, g in groups if k)

Usage用法

>>> list(iter_split(iter("Good evening, world!")))
['Good', 'evening,', 'world!']

No, but it should be easy enough to write one using itertools.takewhile() .不,但是使用itertools.takewhile()编写一个应该很容易。

EDIT:编辑:

Very simple, half-broken implementation:非常简单的半中断实现:

import itertools
import string

def isplitwords(s):
  i = iter(s)
  while True:
    r = []
    for c in itertools.takewhile(lambda x: not x in string.whitespace, i):
      r.append(c)
    else:
      if r:
        yield ''.join(r)
        continue
      else:
        raise StopIteration()

I don't see any obvious benefit to a generator version of split() . 我认为 split()的生成器版本没有任何明显的好处。 The generator object is going to have to contain the whole string to iterate over so you're not going to save any memory by having a generator. 生成器对象将不得不包含要迭代的整个字符串,因此您不会通过使用生成器来节省任何内存。

If you wanted to write one it would be fairly easy though:如果你想写一个,那会很容易:

import string

def gsplit(s,sep=string.whitespace):
    word = []

    for c in s:
        if c in sep:
            if word:
                yield "".join(word)
                word = []
        else:
            word.append(c)

    if word:
        yield "".join(word)

more_itertools.split_at offers an analog to str.split for iterators. more_itertools.split_at为迭代器提供了一个类似于str.split的方法。

>>> import more_itertools as mit


>>> list(mit.split_at("abcdcba", lambda x: x == "b"))
[['a'], ['c', 'd', 'c'], ['a']]

>>> "abcdcba".split("b")
['a', 'cdc', 'a']

more_itertools is a third-party package. more_itertools是第三方包。

I wanted to show how to use the find_iter solution to return a generator for given delimiters and then use the pairwise recipe from itertools to build a previous next iteration which will get the actual words as in the original split method.我想展示如何使用 find_iter 解决方案为给定的分隔符返回一个生成器,然后使用 itertools 中的成对配方来构建上一个下一次迭代,这将获得原始 split 方法中的实际单词。


from more_itertools import pairwise
import re

string = "dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d"
delimiter = " "
# split according to the given delimiter including segments beginning at the beginning and ending at the end
for prev, curr in pairwise(re.finditer("^|[{0}]+|$".format(delimiter), string)):
    print(string[prev.end(): curr.start()])

note:注意:

  1. I use prev & curr instead of prev & next because overriding next in python is a very bad idea我使用 prev & curr 而不是 prev & next 因为在 python 中覆盖 next 是一个非常糟糕的主意
  2. This is quite efficient这是相当有效的

Dumbest method, without regex / itertools:最愚蠢的方法,没有正则表达式 / itertools:

def isplit(text, split='\n'):
    while text != '':
        end = text.find(split)

        if end == -1:
            yield text
            text = ''
        else:
            yield text[:end]
            text = text[end + 1:]

Very old question, but here is my humble contribution with an efficient algorithm:很老的问题,但这是我对高效算法的谦虚贡献:

def str_split(text: str, separator: str) -> Iterable[str]:
    i = 0
    n = len(text)
    while i <= n:
        j = text.find(separator, i)
        if j == -1:
            j = n
        yield text[i:j]
        i = j + 1
def split_generator(f,s):
    """
    f is a string, s is the substring we split on.
    This produces a generator rather than a possibly
    memory intensive list. 
    """
    i=0
    j=0
    while j<len(f):
        if i>=len(f):
            yield f[j:]
            j=i
        elif f[i] != s:
            i=i+1
        else:
            yield [f[j:i]]
            j=i+1
            i=i+1

here is a simple response这是一个简单的回复

def gen_str(some_string, sep):
    j=0
    guard = len(some_string)-1
    for i,s in enumerate(some_string):
        if s == sep:
           yield some_string[j:i]
           j=i+1
        elif i!=guard:
           continue
        else:
           yield some_string[j:]
def isplit(text, sep=None, maxsplit=-1):
    if not isinstance(text, (str, bytes)):
        raise TypeError(f"requires 'str' or 'bytes' but received a '{type(text).__name__}'")
    if sep in ('', b''):
        raise ValueError('empty separator')

    if maxsplit == 0 or not text:
        yield text
        return

    regex = (
        re.escape(sep) if sep is not None
        else [br'\s+', r'\s+'][isinstance(text, str)]
    )
    yield from re.split(regex, text, maxsplit=max(0, maxsplit))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM