繁体   English   中英

合并在python中加入两个生成器

[英]merge join two generators in python

我想按键合并两个京都内阁b树数据库。 京都内阁Python API )。 结果列表应包含两个输入数据库中任何一个的每个唯一键(及其值)。

以下代码有效,但我认为它很丑。
left_generator / right_generator是两个光标对象。 如果生成器用尽了,get()返回None尤其奇怪。

def merge_join_kv(left_generator, right_generator):
stop = False
while left_generator.get() or right_generator.get():
    try:
        comparison = cmp(right_generator.get_key(), left_generator.get_key())
        if comparison == 0:
            yield left_generator.get_key(), left_generator.get_value()
            left_generator.next()
            right_generator.next()
        elif (comparison < 0) or (not left_generator.get() or not right_generator.get()):
            yield right_generator.get_key(), right_generator.get_value()
            right_generator.next()   
        else:
            yield left_generator.get_key(), left_generator.get_value()
            left_generator.next()    
    except StopIteration:
        if stop:
            raise
        stop = True

通常:是否有一个函数/库,将带有cmp()的生成器合并在一起?

我认为这就是您所需要的; orderedMerge基于Gnibbler的代码,但添加了自定义键函数和唯一参数,

import kyotocabinet
import collections
import heapq

class IterableCursor(kyotocabinet.Cursor, collections.Iterator):
    def __init__(self, *args, **kwargs):
        kyotocabinet.Cursor.__init__(self, *args, **kwargs)
        collections.Iterator.__init__(self)

    def next():
        "Return (key,value) pair"
        res = self.get(True)
        if res is None:
            raise StopIteration
        else:
            return res

def orderedMerge(*iterables, **kwargs):
    """Take a list of ordered iterables; return as a single ordered generator.

    @param key:     function, for each item return key value
                    (Hint: to sort descending, return negated key value)

    @param unique:  boolean, return only first occurrence for each key value?
    """
    key     = kwargs.get('key', (lambda x: x))
    unique  = kwargs.get('unique', False)

    _heapify       = heapq.heapify
    _heapreplace   = heapq.heapreplace
    _heappop       = heapq.heappop
    _StopIteration = StopIteration

    # preprocess iterators as heapqueue
    h = []
    for itnum, it in enumerate(map(iter, iterables)):
        try:
            next  = it.next
            data   = next()
            keyval = key(data)
            h.append([keyval, itnum, data, next])
        except _StopIteration:
            pass
    _heapify(h)

    # process iterators in ascending key order
    oldkeyval = None
    while True:
        try:
            while True:
                keyval, itnum, data, next = s = h[0]  # get smallest-key value
                                                      # raises IndexError when h is empty
                # if unique, skip duplicate keys
                if unique and keyval==oldkeyval:
                    pass
                else:
                    yield data
                    oldkeyval = keyval

                # load replacement value from same iterator
                s[2] = data = next()        # raises StopIteration when exhausted
                s[0] = key(data)
                _heapreplace(h, s)          # restore heap condition
        except _StopIteration:
            _heappop(h)                     # remove empty iterator
        except IndexError:
            return    

那么你的功能可以做到

from operator import itemgetter

def merge_join_kv(leftGen, rightGen):
    # assuming that kyotocabinet.Cursor has a copy initializer
    leftIter = IterableCursor(leftGen)
    rightIter = IterableCursor(rightGen)

    return orderedMerge(leftIter, rightIter, key=itemgetter(0), unique=True)

Python 2.6在heapq中具有合并功能,但不支持用户定义的cmp / key func

def merge(*iterables):
    '''Merge multiple sorted inputs into a single sorted output.

    Similar to sorted(itertools.chain(*iterables)) but returns a generator,
    does not pull the data into memory all at once, and assumes that each of
    the input streams is already sorted (smallest to largest).

    >>> list(merge([1,3,5,7], [0,2,4,8], [5,10,15,20], [], [25]))
    [0, 1, 2, 3, 4, 5, 5, 7, 8, 10, 15, 20, 25]

    '''
    _heappop, _heapreplace, _StopIteration = heappop, heapreplace, StopIteration

    h = []
    h_append = h.append
    for itnum, it in enumerate(map(iter, iterables)):
        try:
            next = it.next
            h_append([next(), itnum, next])
        except _StopIteration:
            pass
    heapify(h)

    while 1:
        try:
            while 1:
                v, itnum, next = s = h[0]   # raises IndexError when h is empty
                yield v
                s[0] = next()               # raises StopIteration when exhausted
                _heapreplace(h, s)          # restore heap condition
        except _StopIteration:
            _heappop(h)                     # remove empty iterator
        except IndexError:
            return

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM