简体   繁体   English

在不列出每个块的情况下解析可迭代对象

[英]Parsing an iterable without listifying each chunk

Suppose I want to achieve a splitting of a Python iterable, without listifying each chunk, similar to itertools.groupby , whose chunks are lazy.假设我想实现 Python 可迭代对象的拆分,而不列出每个块,类似于itertools.groupby ,其块是惰性的。 But I want to do it on a more sophisticated condition than equality of a key.但我想在比键相等更复杂的条件下进行。 So more like a parser.所以更像是一个解析器。

For example, suppose I want to use odd numbers as delimiters in an iterable of integers.例如,假设我想在一个可迭代的整数中使用奇数作为分隔符。 Like more_itertools.split_at(lambda x: x % 2 == 1, xs) .比如more_itertools.split_at(lambda x: x % 2 == 1, xs) (But more_itertools.split_at listifies each chunk.) (但是more_itertools.split_at列出了每个块。)

In parser combinator language this might be called sepBy1(odd, many(even)) .在解析器组合器语言中,这可能称为sepBy1(odd, many(even)) In Haskell there are the Parsec , pipes-parse and pipes-group libraries which address this kind of problem.在 Haskell 中,有解决此类问题的Parsecpipes-parsepipes-group库。 For instance, it would be sufficient and interesting to write an itertools.groupby -like version of groupsBy' from Pipes.Group (see here ).例如,从 Pipes.Group 编写itertools.groupby版本的groupsBy'就足够了,也很有趣(请参阅此处)。

There could probably be some clever jiu jitsu with itertools.groupby , perhaps applying itertools.pairwise , then itertools.groupby , and then going back to single elements.可能有一些聪明的柔术与itertools.groupby ,也许应用itertools.pairwise ,然后itertools.groupby ,然后回到单个元素。

I could write it myself as a generator, I suppose, but writing itertools.groupby in Python (below) is already pretty involved.我想我可以自己将它写成一个生成器,但是在 Python(下面)中编写itertools.groupby已经相当复杂了。 Also not readily generalizable.也不容易推广。

Seems like there should be something for this more generally, like a relatively painless way of writing parsers and combinators for streams of whatever type.似乎应该有更普遍的东西,比如为任何类型的流编写解析器和组合器的相对轻松的方式。

# From https://docs.python.org/3/library/itertools.html#itertools.groupby
# groupby() is roughly equivalent to:
class groupby:
    # [k for k, g in groupby('AAAABBBCCDAABBB')] --> A B C D A B
    # [list(g) for k, g in groupby('AAAABBBCCD')] --> AAAA BBB CC D
    def __init__(self, iterable, key=None):
        if key is None:
            key = lambda x: x
        self.keyfunc = key
        self.it = iter(iterable)
        self.tgtkey = self.currkey = self.currvalue = object()
    def __iter__(self):
        return self
    def __next__(self):
        self.id = object()
        while self.currkey == self.tgtkey:
            self.currvalue = next(self.it)    # Exit on StopIteration
            self.currkey = self.keyfunc(self.currvalue)
        self.tgtkey = self.currkey
        return (self.currkey, self._grouper(self.tgtkey, self.id))
    def _grouper(self, tgtkey, id):
        while self.id is id and self.currkey == tgtkey:
            yield self.currvalue
            try:
                self.currvalue = next(self.it)
            except StopIteration:
                return
            self.currkey = self.keyfunc(self.currvalue)

Here are a couple of simple iterator splitters, which I wrote in a fit of boredom.这里有几个简单的迭代器拆分器,是我在无聊中写的。 I don't think they're particularly profound, but perhaps they'll help in some way.我不认为它们特别深刻,但也许它们会以某种方式提供帮助。

I didn't spend a lot of time thinking about useful interfaces, optimisations, or implementing multiple interacting sub-features.我没有花很多时间考虑有用的界面、优化或实现多个交互子功能。 All of that stuff could be added, if desired.如果需要,可以添加所有这些东西。

These are basically modelled on itertools.groupby , whose interface could be considered a bit weird.这些基本上是模仿itertools.groupby的,其界面可能被认为有点奇怪。 It's the consequence of Python really not being a functional programming language.这是 Python 实际上不是函数式编程语言的结果。 Python's generators (and other objects which implement the iterator protocol) are stateful and there is no facility for saving and restoring generator state. So the functions do return an iterator which successively generates iterators, which produce values from the original iterator. Python 的生成器(以及实现迭代器协议的其他对象)是有状态的,并且没有用于保存和恢复生成器 state 的工具。因此函数确实返回一个迭代器,该迭代器连续生成迭代器,迭代器从原始迭代器产生值。 But the returned iterators share the underlying iterable, which is the iterable passed in to the original call, which means that when you advance the outer iterator, any unconsumed values in the current inner iterator are discarded without notice.但是返回的迭代器共享底层的迭代器,它是传递给原始调用的迭代器,这意味着当你推进外部迭代器时,当前内部迭代器中任何未使用的值都会被丢弃,恕不另行通知。

There are (fairly expensive) ways to avoid discarding the values, but since the most obvious one --listifying-- was ruled out from the start, I just went with the groupby interface despite the awkwardness of accurately documenting the behaviour.有(相当昂贵的)方法可以避免丢弃这些值,但由于最明显的方法 --listifying-- 从一开始就被排除在外,尽管准确记录行为很尴尬,但我还是使用了groupby界面。 It would be possible to wrap the inner iterators with itertools.tee in order to make the original iterators independent, but that comes at a price similar to (or possibly slightly greater than) listifying.可以使用itertools.tee包装内部迭代器,以使原始迭代器独立,但代价类似于(或可能略高于)列表化。 It still requires each sub-iterator to be fully generated before the next sub-iterator is started, but it doesn't require the sub-iterator to be fully generated before you start using values.它仍然要求每个子迭代器在下一个子迭代器启动之前完全生成,但它不需要在开始使用值之前完全生成子迭代器。

For simplicity (according to me:-) ), I implemented these functions as generators rather than objects, as with itertools and more_itertools .为简单起见(根据我的说法:-)),我将这些函数实现为生成器而不是对象,就像itertoolsmore_itertools The outer generator yields each successive subiterator and then collects and discards any remaining values from it before yielding the next subiterator [Note 1].外部生成器生成每个连续的子迭代器,然后在生成下一个子迭代器之前收集并丢弃其中的所有剩余值 [注 1]。 I imagine that most of the time the subiterator will be fully exhausted before the outer loop tries to flush it, so the additional call will be a bit wasteful, but it's simpler than the code you cite for itertools.groupby .我想大多数时候子迭代器会在外循环尝试刷新它之前完全耗尽,所以额外的调用会有点浪费,但它比你为itertools.groupby引用的代码简单。

It's still necessary to communicate back from the subiterator the fact that the original iterator was exhausted, since that's not something you can ask an iterator about.仍然有必要从子迭代器传回原始迭代器已耗尽的事实,因为这不是您可以询问迭代器的事情。 I use a nonlocal declaration to share state between the outer and the inner generators.我使用nonlocal声明在外部和内部生成器之间共享 state。 In some ways, maintaining state in an object, as itertools.groupby does, might be more flexible and maybe even be considered more Pythonic, but nonlocal worked for me.在某些方面,将 state 保留在nonlocal中,如itertools.groupby所做的那样,可能更灵活,甚至可能被认为更符合 Pythonic,但 nonlocal 对我有用。

I implemented more_itertools.split_at (without maxsplits and keep_separator options) and what I think is equivalent of Pipes.Groups.groupBy' , renamed as split_between to indicate that it splits between two consecutive elements if they satisfy some condition.我实现more_itertools.split_at (没有maxsplitskeep_separator选项),我认为等同于Pipes.Groups.groupBy' ,重命名为split_between以指示它在满足某些条件的两个连续元素之间拆分。

Note that split_between always forces the first value from the supplied iterator before it has been requested by running the first subiterator.请注意, split_between总是在运行第一个子迭代器请求它之前从提供的迭代器强制第一个值。 The rest of the values are generated lazily.值的 rest 是延迟生成的。 I tried a few ways to defer the first object, but in the end I went with this design because it's a lot simpler.我尝试了几种方法来推迟第一个 object,但最终我还是采用了这种设计,因为它要简单得多。 The consequence is that split_at , which doesn't do the initial force, always returns at least one subiterator, even if the supplied argument is empty, whereas split_between does not.结果是split_at不执行初始力,它总是至少返回一个子迭代器,即使提供的参数为空,而split_between则不然。 I'd have to try both of these for some real problem in order to decide which interface I prefer;我必须尝试这两种方法来解决一些实际问题,才能决定我更喜欢哪个界面; if you have a preference, by all means express it (but no guarantees about changes).如果您有偏好,一定要表达出来(但不保证会改变)。

from collections import deque

def split_at(iterable, pred=lambda x:x is None):
    '''Produces an iterator which returns successive sub-iterations of 
       `iterable`, delimited by values for which `pred` returns
       truthiness. The default predicate returns True only for the
       value None.

       The sub-iterations share the underlying iterable, so they are not 
       independent of each other. Advancing the outer iterator will discard
       the rest of the current sub-iteration.

       The delimiting values are discarded.
    '''

    done = False
    iterable = iter(iterable)

    def subiter():
        nonlocal done
        for value in iterable:
            if pred(value): return
            yield value
        done = True

    while not done:
        yield (g := subiter())
        deque(g, maxlen=0)

def split_between(iterable, pred=lambda before,after:before + 1 != after):
    '''Produces an iterator which returns successive sub-iterations of 
       `iterable`, delimited at points where calling `pred` on two
       consecutive values produces truthiness. The default predicate
       returns True when the two values are not consecutive, making it
       possible to split a sequence of integers into contiguous ranges.

       The sub-iterations share the underlying iterable, so they are not 
       independent of each other. Advancing the outer iterator will discard
       the rest of the current sub-iteration.
    '''
    iterable = iter(iterable)

    try:
        before = next(iterable)
    except StopIteration:
        return

    done = False

    def subiter():
        nonlocal done, before
        for after in iterable:
            yield before
            prev, before = before, after
            if pred(prev, before):
                return

        yield before
        done = True

    while not done:
        yield (g := subiter())
        deque(g, maxlen=0)

Notes笔记

  1. collections.deque(g, maxlen=0) is, I believe, currently the most efficient way of discarding the remaining values of an iterator, although it looks a bit mysterious.我认为collections.deque(g, maxlen=0)目前是丢弃迭代器剩余值的最有效方法,尽管它看起来有点神秘。 Credits to more_itertools for pointing me at that solution, and the related expression to count the number of objects produced by a generator:感谢more_itertools向我指出了该解决方案,以及用于计算生成器生成的对象数量的相关表达式:
     cache[0][0] if (cache:= deque(enumerate(it, 1), maxlen=1)) else 0
    Although I don't mean to blame more_itertools for the above monstrosity.尽管我并不是要将上述怪异行为归咎于more_itertools (They do it with an if statement, not a walrus.) (他们用if语句来做,而不是海象。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM