简体   繁体   English

在 python 中每隔 n 个项目拆分一个生成器/可迭代对象 (splitEvery)

[英]split a generator/iterable every n items in python (splitEvery)

I'm trying to write the Haskell function 'splitEvery' in Python. Here is it's definition:我正在尝试在 Python 中编写 Haskell function 'splitEvery'。这是它的定义:

splitEvery :: Int -> [e] -> [[e]]
    @'splitEvery' n@ splits a list into length-n pieces.  The last
    piece will be shorter if @n@ does not evenly divide the length of
    the list.

The basic version of this works fine, but I want a version that works with generator expressions, lists, and iterators.它的基本版本工作正常,但我想要一个适用于生成器表达式、列表和迭代器的版本。 And , if there is a generator as an input it should return a generator as an output!而且,如果有一个生成器作为输入,它应该返回一个生成器作为输出!

Tests测试

# should not enter infinite loop with generators or lists
splitEvery(itertools.count(), 10)
splitEvery(range(1000), 10)

# last piece must be shorter if n does not evenly divide
assert splitEvery(5, range(9)) == [[0, 1, 2, 3, 4], [5, 6, 7, 8]]

# should give same correct results with generators
tmp = itertools.islice(itertools.count(), 10)
assert list(splitEvery(5, tmp)) == [[0, 1, 2, 3, 4], [5, 6, 7, 8]]

Current Implementation当前实施

Here is the code I currently have but it doesn't work with a simple list.这是我目前拥有的代码,但不适用于简单列表。

def splitEvery_1(n, iterable):
    res = list(itertools.islice(iterable, n))
    while len(res) != 0:
        yield res
        res = list(itertools.islice(iterable, n))

This one doesn't work with a generator expression (thanks to jellybean for fixing it):这个不适用于生成器表达式(感谢 jellybean 修复它):

def splitEvery_2(n, iterable): 
    return [iterable[i:i+n] for i in range(0, len(iterable), n)]

There has to be a simple piece of code that does the splitting.必须有一段简单的代码来进行拆分。 I know I could just have different functions but it seems like it should be and easy thing to do.我知道我可以有不同的功能,但看起来应该很容易做到。 I'm probably getting stuck on an unimportant problem but it's really bugging me.我可能陷入了一个不重要的问题,但它确实困扰着我。


It is similar to grouper fromhttp://docs.python.org/library/itertools.html#itertools.groupby but I don't want it to fill extra values.它类似于来自http://docs.python.org/library/itertools.html#itertools.groupby的石斑鱼,但我不希望它填充额外的值。

def grouper(n, iterable, fillvalue=None):
    "grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

It does mention a method that truncates the last value.它确实提到了截断最后一个值的方法。 This isn't what I want either.这也不是我想要的。

The left-to-right evaluation order of the iterables is guaranteed.迭代的从左到右的评估顺序是有保证的。 This makes possible an idiom for clustering a data series into n-length groups using izip(*[iter(s)]*n).这使得使用 izip(*[iter(s)]*n) 将数据序列聚类成 n 个长度的组成为可能。

list(izip(*[iter(range(9))]*5)) == [[0, 1, 2, 3, 4]]
# should be [[0, 1, 2, 3, 4], [5, 6, 7, 8]]
from itertools import islice

def split_every(n, iterable):
    i = iter(iterable)
    piece = list(islice(i, n))
    while piece:
        yield piece
        piece = list(islice(i, n))

Some tests:一些测试:

>>> list(split_every(5, range(9)))
[[0, 1, 2, 3, 4], [5, 6, 7, 8]]

>>> list(split_every(3, (x**2 for x in range(20))))
[[0, 1, 4], [9, 16, 25], [36, 49, 64], [81, 100, 121], [144, 169, 196], [225, 256, 289], [324, 361]]

>>> [''.join(s) for s in split_every(6, 'Hello world')]
['Hello ', 'world']

>>> list(split_every(100, []))
[]

Here's a quick one-liner version.这是一个快速的单行版本。 Like Haskell's, it is lazy.像 Haskell 一样,它很懒惰。

from itertools import islice, takewhile, repeat
split_every = (lambda n, it:
    takewhile(bool, (list(islice(it, n)) for _ in repeat(None))))

This requires that you use iter before calling split_every .这要求您调用split_every之前使用iter

Example:示例:

list(split_every(5, iter(xrange(9))))
[[0, 1, 2, 3, 4], [5, 6, 7, 8]]

Although not a one-liner, the version below doesn't require that you call iter which can be a common pitfall.虽然不是单行,但下面的版本不需要您调用iter ,这可能是一个常见的陷阱。

from itertools import islice, takewhile, repeat

def split_every(n, iterable):
    """
    Slice an iterable into chunks of n elements
    :type n: int
    :type iterable: Iterable
    :rtype: Iterator
    """
    iterator = iter(iterable)
    return takewhile(bool, (list(islice(iterator, n)) for _ in repeat(None)))

(Thanks to @eli-korvigo for improvements.) (感谢@eli-korvigo 的改进。)

building off of the accepted answer and employing a lesser-known use of iter (that, when passed a second arg, it calls the first until it receives the second), you can do this really easily:构建接受的答案并使用一个鲜为人知的iter用法(即,当传递第二个 arg 时,它会调用第一个直到收到第二个),您可以非常轻松地做到这一点:

python3:蟒蛇3:

from itertools import islice

def split_every(n, iterable):
    iterable = iter(iterable)
    yield from iter(lambda: list(islice(iterable, n)), [])

python2:蟒蛇2:

def split_every(n, iterable):
    iterable = iter(iterable)
    for chunk in iter(lambda: list(islice(iterable, n)), []):
        yield chunk

more_itertools has a chunked function: more_itertools有一个chunked功能:

import more_itertools as mit


list(mit.chunked(range(9), 5))
# [[0, 1, 2, 3, 4], [5, 6, 7, 8]]

I came across this as I'm trying to chop up batches too, but doing it on a generator from a stream, so most of the solutions here aren't applicable, or don't work in python 3.我也遇到了这个问题,因为我也试图切碎批次,但是在流中的生成器上执行此操作,因此此处的大多数解决方案都不适用,或者在 python 3 中不起作用。

For people still stumbling upon this, here's a general solution using itertools:对于仍然遇到这个问题的人,这里有一个使用 itertools 的通用解决方案:

from itertools import islice, chain

def iter_in_slices(iterator, size=None):
    while True:
        slice_iter = islice(iterator, size)
        # If no first object this is how StopIteration is triggered
        peek = next(slice_iter)
        # Put the first object back and return slice
        yield chain([peek], slice_iter)

A one-liner, inlineable solution to this (supports v2/v3, iterators, uses standard library and a single generator comprehension):一个单行、内联的解决方案(支持 v2/v3、迭代器、使用标准库和单个生成器理解):

import itertools
def split_groups(iter_in, group_size):
     return ((x for _, x in item) for _, item in itertools.groupby(enumerate(iter_in), key=lambda x: x[0] // group_size))

I think those questions are almost equal我认为这些问题几乎是平等的

Changing a little bit to crop the last, I think a good solution for the generator case would be:稍微改变一下以裁剪最后一个,我认为发电机情况的一个很好的解决方案是:

from itertools import *
def iter_grouper(n, iterable):
    it = iter(iterable)
    item = itertools.islice(it, n)
    while item:
        yield item
        item = itertools.islice(it, n)

for the object that supports slices (lists, strings, tuples), we can do:对于支持切片(列表、字符串、元组)的对象,我们可以这样做:

def slice_grouper(n, sequence):
   return [sequence[i:i+n] for i in range(0, len(sequence), n)]

now it's just a matter of dispatching the correct method:现在只需发送正确的方法即可:

def grouper(n, iter_or_seq):
    if hasattr(iter_or_seq, "__getslice__"):
        return slice_grouper(n, iter_or_seq)
    elif hasattr(iter_or_seq, "__iter__"):
        return iter_grouper(n, iter_or_seq)

I think you could polish it a little bit more :-)我想你可以再润色一点:-)

Why not do it like this?为什么不这样做呢? Looks almost like your splitEvery_2 function.看起来几乎像你的splitEvery_2函数。

def splitEveryN(n, it):
    return [it[i:i+n] for i in range(0, len(it), n)]

Actually it only takes away the unnecessary step interval from the slice in your solution.实际上,它只会从解决方案中的切片中消除不必要的步骤间隔。 :) :)

This is an answer that works for both list and generator:这是一个适用于列表和生成器的答案:

from itertools import count, groupby
def split_every(size, iterable):
    c = count()
    for k, g in groupby(iterable, lambda x: next(c)//size):
        yield list(g) # or yield g if you want to output a generator

Here is how you deal with list vs iterator:以下是您如何处理列表与迭代器:

def isList(L): # Implement it somehow - returns True or false
...
return (list, lambda x:x)[int(islist(L))](result)

A fully lazy solution for input/output of generators, including some checking.生成器输入/输出的完全惰性解决方案,包括一些检查。

def chunks(items, binsize):
    consumed = [0]
    sent = [0]
    it = iter(items)

    def g():
        c = 0
        while c < binsize:
            try:
                val = next(it)
            except StopIteration:
                sent[0] = None
                return
            consumed[0] += 1
            yield val
            c += 1

    while consumed[0] <= sent[0]:
        if consumed[0] < sent[0]:
            raise Exception("Cannot traverse a chunk before the previous is consumed.", consumed[0], sent[0])
        yield g()
        if sent[0] is None:
            return
        sent[0] += binsize


def g():
    for item in [1, 2, 3, 4, 5, 6, 7]:
        sleep(1)
        print(f"accessed:{item}→\t", end="")
        yield item


for chunk in chunks(g(), 3):
    for x in chunk:
        print(f"x:{x}\t\t\t", end="")
    print()

"""
Output:

accessed:1→ x:1         accessed:2→ x:2         accessed:3→ x:3         
accessed:4→ x:4         accessed:5→ x:5         accessed:6→ x:6         
accessed:7→ x:7 
"""
def chunks(iterable,n):
    """assumes n is an integer>0
    """
    iterable=iter(iterable)
    while True:
        result=[]
        for i in range(n):
            try:
                a=next(iterable)
            except StopIteration:
                break
            else:
                result.append(a)
        if result:
            yield result
        else:
            break

g1=(i*i for i in range(10))
g2=chunks(g1,3)
print g2
'<generator object chunks at 0x0337B9B8>'
print list(g2)
'[[0, 1, 4], [9, 16, 25], [36, 49, 64], [81]]'

this will do the trick这将解决问题

from itertools import izip_longest
izip_longest(it[::2], it[1::2])

where *it* is some iterable其中 *it* 是一些可迭代的


Example:示例:

izip_longest('abcdef'[::2], 'abcdef'[1::2]) -> ('a', 'b'), ('c', 'd'), ('e', 'f')

Let's break this down让我们分解一下

'abcdef'[::2] -> 'ace'
'abcdef'[1::2] -> 'bdf'

As you can see the last number in the slice is specifying the interval that will be used to pick up items.如您所见,切片中的最后一个数字指定将用于拾取项目的间隔。 You can read more about using extended slices here .您可以在此处阅读有关使用扩展切片的更多信息。

The zip function takes the first item from the first iterable and combines it with the first item with the second iterable. zip函数从第一个可迭代对象中获取第一项,并将其与带有第二个可迭代对象的第一项合并。 The zip function then does the same thing for the second and third items until one of the iterables runs out of values.然后 zip 函数对第二个和第三个项目执行相同的操作,直到其中一个可迭代对象用完值。

The result is an iterator.结果是一个迭代器。 If you want a list use the list() function on the result.如果您想要一个列表,请对结果使用 list() 函数。

If you want a solution that如果你想要一个解决方案

  • uses generators only (no intermediate lists or tuples),仅使用生成器(没有中间列表或元组),
  • works for very long (or infinite) iterators,适用于很长(或无限)的迭代器,
  • works for very large batch sizes,适用于非常大的批量,

this does the trick:这是诀窍:

def one_batch(first_value, iterator, batch_size):
    yield first_value
    for i in xrange(1, batch_size):
        yield iterator.next()

def batch_iterator(iterator, batch_size):
    iterator = iter(iterator)
    while True:
        first_value = iterator.next()  # Peek.
        yield one_batch(first_value, iterator, batch_size)

It works by peeking at the next value in the iterator and passing that as the first value to a generator ( one_batch() ) that will yield it, along with the rest of the batch.它的工作原理是查看迭代器中的下一个值并将其作为第一个值传递给生成器( one_batch() ),该生成器将生成它,以及批处理的其余部分。

The peek step will raise StopIteration exactly when the input iterator is exhausted and there are no more batches. peek 步骤将在输入迭代器耗尽且没有更多批次时准确地引发StopIteration Since this is the correct time to raise StopIteration in the batch_iterator() method, there is no need to catch the exception.由于这是在batch_iterator()方法中引发StopIteration的正确时间,因此无需捕获异常。

This will process lines from stdin in batches:这将分批处理来自 stdin 的行:

for input_batch in batch_iterator(sys.stdin, 10000):
    for line in input_batch:
        process(line)
    finalise()

I've found this useful for processing lots of data and uploading the results in batches to an external store.我发现这对于处理大量数据并将结果批量上传到外部存储很有用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM