简体   繁体   English

Python生成器的多个客户端?

[英]Multiple clients for a Python generator?

As a follow up to this question I am trying to circumvent the list building exemplified by range(int(1e8)) using a generator xrange(int(1e8)) . 作为这个问题的后续,我试图使用生成器xrange(int(1e8))来绕过range(int(1e8))所示的列表构建。 Where the xrange is just an example for a process that produces a long sequence of values. 其中xrange只是产生一长串值的过程的一个例子。 (Please assume it can not be easily reproduced.) Some more background is, I have a long list of timestamp/value pairs that I want to do some processing on (sort of time-series). (请假设它不能轻易复制。)还有一些背景知识,我有一长串的时间戳/值对,我想对它进行一些处理(有时间序列)。 I try to avoid pulling these into memory as a whole, because that's a lot of data. 我试图避免将这些内容整体记录到内存中,因为这是很多数据。

I thought it would be cool, if I could apply multiple processing units simultaneously to this stream of data produced by my generator. 我认为如果我可以将多个处理单元同时应用于我的生成器生成的数据流,那将会很酷。 The first idea was to use itertools.tee() , eg: 第一个想法是使用itertools.tee() ,例如:

from itertools import tee
g1,g2 = tee(xrange(int(1e8)),2)
sum(g1), sum(g2)

But then I found that only the first sum() would use the generator, while tee() internally builds a list again (Which I wanted to avoid.). 但后来我发现只有第一个sum()会使用生成器,而tee()会在内部再次构建一个list (我想避免使用它)。

So I thought, I'm in need for a asynchronous solution, ie one that would allow each sum() do an update every generator step. 所以我想,我需要一个异步解决方案,即允许每个sum()在每个生成器步骤进行更新的解决方案。 The things that came in mind where 想到的东西在哪里

But me having neither really used before, and partly I can not even tell whether the approaches might work, or be effective/efficient/performant. 但是我之前没有真正使用过,部分我甚至无法判断这些方法是否有效,或者是有效/高效/高效的。

From this point, I would gladly appreciate any suggestions from the audience! 从这一点来说,我很乐意感谢观众的任何建议!


Update 更新

I wanted to avoid the callback based solution , as it apparantly decreases performance significantly (This is how it's currently implemented.). 我想避免使用基于回调的解决方案 ,因为它显着地降低了性能(这是它当前实现的方式)。 I have added some profiling below (please add comments if the test isn't objective): 我在下面添加了一些分析(如果测试不客观,请添加注释):

class SinkA:
  def __init__(self, src):
    for i in src: pass

class SinkB:
  def f(self,i):
    pass

class Source:
  def __iter__(self):
    for i in xrange(int(1e4)):
      yield i

def t1():
  src = Source()
  snk = SinkA(src)

def t2():
  src = Source()
  snk = SinkB()
  for i in src: snk.f(i)

if __name__ == "__main__":
    from timeit import Timer
    n = 1000
    t = Timer("t1()", "from __main__ import t1, t2, SinkA, SinkB, Source")
    print "%.2f usec/pass" % (1000000 * t.timeit(number=n)/n) # 612.11 usec/pass
    t = Timer("t2()", "from __main__ import t1, t2, SinkA, SinkB, Source")
    print "%.2f usec/pass" % (1000000 * t.timeit(number=n)/n) # 1933.39 usec/pass

Update 2 更新2

What more can I say? 我还能说什么呢? I have this callback-based solution, that appears to be inefficient. 我有这个基于回调的解决方案,看起来效率很低。 The generator-based approach appears promising, but I have too little experience with that kind of programming, especially when it comes to more sophisticated things as coroutines, or the twisted library. 基于生成器的方法看起来很有前途,但我对这种编程的经验太少,特别是当它涉及更复杂的事物如协同程序或扭曲的库时。 To sum up, I have multiple consumers for a process that generates lots of data, and I have spotted some potential approaches. 总而言之,我有一个生成大量数据的流程的多个消费者,我发现了一些潜在的方法。 Now I'm looking for qualified statements by experienced users that probably have accomplished similar tasks before. 现在我正在寻找有经验的用户可能已经完成类似任务的合格声明。 Statements that address what approach could be appropriate, how the approaches relate to each other. 处理哪种方法可能合适的方式,方法如何相互关联。 Or what other approaches I might have missed after all. 或者我可能错过了其他什么方法。

As a generic approach, I would replace the generator's pull model with callbacks, and, probably, wrap the generator, like this: 作为一种通用方法,我会用回调替换生成器的拉模型,并且可能包装生成器,如下所示:

def walk(gen, callbacks):
    for item in gen:
        for f in callbacks:
            f(item)

If your processors are in separate threads and you want them to block on waiting, you can register Queue.put (or anything equivalent) as a callback for each processor, and poll those queues independently. 如果您的处理器位于不同的线程中并且您希望它们在等待时阻塞,则可以将Queue.put (或任何等效项)注册为每个处理器的回调,并独立轮询这些队列。 This will allow you to use both broadcasting and worker-pool models if you need so. 如果需要,这将允许您使用广播和工作池模型。

Edit 编辑

Another solution would be to use coroutines : 另一个解决方案是使用协同程序

def source(self, *dests):
    for i in xrange(int(1e4)):
        for dest in dests:
            dest.send(i)

def sink():
    while True:
        i = yield

def t3():
    snk = sink()
    snk.next() # activate the coroutine
    source(snk)

if __name__ == '__main__':

    from timeit import Timer
    n = 1000
    t = Timer("t3()", "from __main__ import source, sink, t3")
    print "%.2f usec/pass" % (1000000 * t.timeit(number=n)/n) # 872.99 usec/pass

Looks fast enough. 看起来足够快。 Basically, coroutines are the inverted generators, you pull from generator, push to coroutine. 基本上,协同程序是倒置发生器,你从发电机拉,推到协程。

You don't really address this, but do you want each consumer to see the exact same data (in which case tee is probably the best solution), or not? 你没有真正解决这个问题,但是你想让每个消费者看到完全相同的数据(在这种情况下, tee可能是最好的解决方案),或者不是吗?

If not, then you can simply have each consumer read from the one generator object. 如果没有,那么您可以简单地让每个消费者从一个生成器对象中读取。

If you do want them to get the exact same data, try tee (uses more memory) vs two generators (more IO), and see which is faster. 如果您确实希望它们获得完全相同的数据,请尝试使用tee (使用更多内存)和两个生成器(更多IO),并查看哪个更快。

As to your timings, what your data show is simply that there is an overhead to multiple function calls, and that one of your methods avoids intermediate function calls. 至于你的时间,你的数据显示的只是多个函数调用的开销,并且你的一个方法避免了中间函数调用。

If you want to improve performance, try running this on PyPy, which has a hotspot-optimising JIT. 如果你想提高性能,试试在PyPy上运行它,它有一个热点优化JIT。

Since generators are cheap in memory, why don't you simply use two independent generators? 由于发电机内存便宜,为什么不简单地使用两个独立的发电机?

g1 = xrange(int(1e8))
g2 = xrange(int(1e8))
sum(g1), sum(g2)

Solution for sharing python generators with tests: 使用测试共享python生成器的解决方案:

https://gist.github.com/earonesty/cafa4626a2def6766acf5098331157b3 https://gist.github.com/earonesty/cafa4626a2def6766acf5098331157b3

Example of use: 使用示例:

def mygen():
       yield from [1,2,3]

m1 = Muxer(mygen)
m2 = Muxer(mygen)

consume1(m1)
consume2(m2)

Code for muxer.py : muxer.py代码:

import queue
from threading import Lock
from collections import namedtuple

class Muxer():
    Entry = namedtuple('Entry', 'genref listeners, lock')

    already = {}
    top_lock = Lock()

    def __init__(self, func, restart=False):
        self.restart = restart
        self.func = func
        self.queue = queue.Queue()

        with self.top_lock:
            if func not in self.already:
                self.already[func] = self.Entry([func()], [], Lock())
            ent = self.already[func]

        self.genref = ent.genref
        self.lock = ent.lock
        self.listeners = ent.listeners

        self.listeners.append(self)

    def __iter__(self):
        return self

    def __next__(self):
        try:
            e = self.queue.get_nowait()
        except queue.Empty:
            with self.lock:
                try:
                    e = self.queue.get_nowait()
                except queue.Empty:
                    try:
                        e = next(self.genref[0])
                        for other in self.listeners:
                            if not other is self:
                                other.queue.put(e)
                    except StopIteration:
                        if self.restart:
                            self.genref[0] = self.func()
                        raise
        return e

    def __del__(self):
        with self.top_lock:
            try:
                self.listeners.remove(self)
            except ValueError:
                pass
            if not self.listeners and self.func in self.already:
                del self.already[self.func]

我建议你查看如何使用协同程序实现这一点,更具体地说是这个广播示例

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM