繁体   English   中英

创建一个Python生成器,从两个大列表中生成有序的整数乘积

[英]Creating a Python generator that yields ordered products of integers from two large lists

所以,我有两个非常大的数字列表l1l2 我想将l1的每个元素与l2每个元素相乘, 而不会明确地创建一个新的产品列表。 因此,我想要一台发电机。 这部分很容易。 我可以做点什么

for a in l1:
    for b in l2:
        yield a * b

但是,我还需要按照它们的大小来订购这些产品。 我想知道是否有一些聪明的技巧来订购yield语句,这样也可以使用生成器来完成。 在Python 3中,如果可能的话。 谢谢。

我将调用列表xsys ,并假设它们已经排序。 正如您在评论中指出的那样,最小的乘积必然是xs[0] * ys[0] - 但只有当您还假设所有数字都是非负数时,我才会假设。

在第一个产品之后,它变得更加混乱 - 否则你已经解决了它;-)接下来要考虑的是xs[0] * ys[1]xs[1] * ys[0] 很容易,但接下来要考虑的取决于哪些赢了。 如果xs[0] * ys[1]赢了,你只需要用xs[0] * ys[2]替换它,但是如果xs[1] * ys[0]赢了,那么两个xs[1] * ys[1]xs[2] * ys[0]开始发挥作用。 等等。

以下内容通过堆跟踪不断增长的可能性。 堆永远不会超过len(xs)项,因此代码首先安排使xs成为更短的列表:

def upprod(xs, ys):
    # xs and ys must be sorted, and non-negative
    from heapq import heappush, heappop
    # make xs the shorter
    if len(ys) < len(xs):
        xs, ys = ys, xs
    if not xs:
        return
    lenxs = len(xs)
    lenys = len(ys)
    # the heap holds 4-tuples:
    #     (product, xs index, ys index, xs[xs index])
    h = [(xs[0] * ys[0], 0, 0, xs[0])]
    while h:
        prod, xi, yi, x = heappop(h)
        yield prod
        # same x with next y
        yi += 1
        if yi < lenys:
            heappush(h, (x * ys[yi], xi, yi, x))
        # if this is the first time we used x, start
        # the next x going
        if yi == 1:
            xi += 1
            if xi < lenxs:
                x = xs[xi]
                heappush(h, (x * ys[0], xi, 0, x))

如果存在一种本质上更有效的解决方案,我会感到惊喜。 如果有人认为他们有一个,请先使用这个随机测试器尝试:

from itertools import product
from random import randrange
MAXLEN = 10
UB = 1000
ntest = 0
while True:
    ntest += 1
    lenxs = randrange(MAXLEN + 1)
    lenys = randrange(MAXLEN + 1)
    xs = sorted(randrange(UB) for i in range(lenxs))
    ys = sorted(randrange(UB) for i in range(lenys))
    brute = sorted(a*b for a, b in product(xs, ys))
    got = list(upprod(xs, ys))
    if brute != got:
        print("OUCH!")
        print(xs)
        print(ys)
        print(brute)
        print(got)
        assert False
    if ntest % 10_000 == 0:
        print(f"finished test {ntest:,}")

编辑 - 在某些意义上理论上更好;-)

上面没有充分利用我们可以单独从索引中推导出的偏序:if

i1 <= i2 and j1 <= j2

然后我们知道

xs[i1] * ys[j1] <= xs[i2] * ys[j2]

因为排序意味着xs[i1] <= xs[i2]ys[j1] <= ys[j2]

因此,例如,如果索引对(0, 1)(1, 0)在堆上,而第二个获胜,则需要将(2, 0)添加到堆中,但(1, 1)不会't:仅从索引中,我们知道堆中剩余的(0, 1)的产品不大于(1, 1) 只有当(0, 1)也被删除时才需要添加(1, 1)

通常,每对形式(i, 0)具有单个前一个前导(i-1, 0) ,并且(0, j)单个(0, j-1) ,以及所有其他(i, j)有两个直接前辈: (i-1, j)(i, j-1) 除非所有的前辈都已从堆中取出,否则无需在堆上放置一对。

这导致了这个代码,它看似“更优雅”,因为更加对称:

def upprod(xs, ys):
    # xs and ys must be sorted, and non-negative
    from heapq import heappush, heappop
    # make xs the shorter
    if len(ys) < len(xs):
        xs, ys = ys, xs
    if not xs:
        return
    lenxs = len(xs)
    lenys = len(ys)
    # the heap holds 3-tuples:
    #     (product, xs index, ys index)
    h = [(xs[0] * ys[0], 0, 0)]

    # interior points for which only one immediate predecessor has
    # been processed; there's no need to put them in the heap
    # until their second predecessor has been processed too
    pending = set()

    def add(xi, yi):
        if xi < lenxs and yi < lenys:
            if xi and yi: # if either is 0, only one predecessor
                p = xi, yi
                if p in pending:
                    pending.remove(p)
                else:
                    pending.add(p)
                    return
            heappush(h, (xs[xi] * ys[yi], xi, yi))

    while h:
        prod, xi, yi = heappop(h)
        yield prod
        # same x with next y; and same y with next x
        add(xi, yi + 1)
        add(xi + 1, yi)
    assert not pending

与第一个代码相比,它在许多情况下使堆保持较小。 但堆操作需要时间对数的堆条目,并且堆仍然可以增长到len(xs)条目,所以这不是一个胜利。 它可能会丢失两个新函数调用的开销(虽然内联那些太难看了)。

我的解决方案是创建一个生成器列表,为产品矩阵中的每一行创建一个生成器,然后使用heapq.merge对这些生成器的输出进行排序。 每个生成器在32位机器上的大小为44字节,因此整个生成器列表仅消耗适量的RAM。

heapq.merge (当没有提供排序键函数时)通过创建传递它的每个迭代的3元组来工作。 该元组包含iterable中的下一个值,iterable的索引号以及对iterable的__next__方法的引用。 它将这些元组放在堆上以执行可迭代值的合并。 您可以在其Python 源代码中查看详细信息。

因此,我的方法并不像蒂姆·彼得斯的解决方案那样节俭,但它并不是太破旧,恕我直言。 ;)

def sorted_prod_merge(xs, ys):
    ''' mergesort generators of the rows. '''
    if len(ys) < len(xs):
        xs, ys = ys, xs
    def gen(x):
        for y in ys:
            yield x * y
    yield from merge(*[gen(x) for x in xs])

这里有一些timeit代码,它显示了sorted_prod_merge ,Tim Peters的解决方案以及我的其他一些版本的运行时间。 我使用Tim的变量名来保持代码的统一。 值得注意的是,蒂姆的第一个版本大约是他更高级解决方案的两倍。 我的sorted_prod_row运行得非常快,但这是一个可怕的RAM生猪。

timeit代码使用itertools配方中给出的一种技术来耗尽迭代器:我们将它提供给零长度的双端队列。 time_test代码对每个Timer运行的3个结果进行排序。 这是因为最小结果是重要结果,其他值只是表示测试运行时系统的变化。 有关详细信息,请参阅Timer.repeat文档中的注释。

from heapq import heappush, heappop, merge
from random import seed, randrange
from timeit import Timer
from collections import deque

seed(163)

# Brute force method, as a generator
def sorted_prod_brute(xs, ys):
    yield from sorted(x * y for x in xs for y in ys)

# By Tim Peters
def upprod1(xs, ys):
    # xs and ys must be sorted, and non-negative
    from heapq import heappush, heappop
    # make xs the shorter
    if len(ys) < len(xs):
        xs, ys = ys, xs
    if not xs:
        return
    lenxs = len(xs)
    lenys = len(ys)
    # the heap holds 4-tuples:
    #     (product, xs index, ys index, xs[xs index])
    h = [(xs[0] * ys[0], 0, 0, xs[0])]
    while h:
        prod, xi, yi, x = heappop(h)
        yield prod
        # same x with next y
        yi += 1
        if yi < lenys:
            heappush(h, (x * ys[yi], xi, yi, x))
        # if this is the first time we used x, start
        # the next x going
        if yi == 1:
            xi += 1
            if xi < lenxs:
                x = xs[xi]
                heappush(h, (x * ys[0], xi, 0, x))

# By Tim Peters
def upprod2(xs, ys):
    # xs and ys must be sorted, and non-negative
    from heapq import heappush, heappop
    # make xs the shorter
    if len(ys) < len(xs):
        xs, ys = ys, xs
    if not xs:
        return
    lenxs = len(xs)
    lenys = len(ys)
    # the heap holds 3-tuples:
    #     (product, xs index, ys index)
    h = [(xs[0] * ys[0], 0, 0)]

    # interior points for which only one immediate predecessor has
    # been processed; there's no need to put them in the heap
    # until their second predecessor has been processed too
    pending = set()

    def add(xi, yi):
        if xi < lenxs and yi < lenys:
            doit = True
            if xi and yi: # if either is 0, only one predecessor
                p = xi, yi
                if p in pending:
                    pending.remove(p)
                else:
                    pending.add(p)
                    doit = False
            if doit:
                heappush(h, (xs[xi] * ys[yi], xi, yi))
    while h:
        prod, xi, yi = heappop(h)
        yield prod
        # same x with next y; and same y with next x
        add(xi, yi + 1)
        add(xi + 1, yi)
    assert not pending

def sorted_prod_merge(xs, ys):
    ''' mergesort generators of the rows. '''
    if len(ys) < len(xs):
        xs, ys = ys, xs
    def gen(x):
        for y in ys:
            yield x * y
    yield from merge(*[gen(x) for x in xs])

def sorted_prod_row(xs, ys):
    ''' Heapsort, row by row.
        Fast, but not space-efficient: the maximum 
        heap size grows to almost len(ys) * len(xs)
    '''
    if len(ys) < len(xs):
        xs, ys = ys, xs
    if not xs:
        return
    x, xs = xs[0], xs[1:]
    heap = []
    #big = 0
    for y in ys:
        lo = x * y
        while heap and heap[0] <= lo:
            yield heappop(heap)
        yield lo
        for u in xs:
            heappush(heap, u * y)
        #big = max(big, len(heap))
    #print(big)
    while heap:
        yield heappop(heap)

def sorted_prod_diag(xs, ys):
    ''' Heapsort, going along the diagonals
        50% slower than sorted_prod_row, but more
        space-efficient: the maximum heap size 
        grows to around 0.5 * len(ys) * len(xs)
    '''
    if not (xs and ys):
        return
    lenxs, lenys = len(xs), len(ys)
    heap = []
    #big = 0
    for n in range(lenxs + lenys - 1):
        row = sorted(xs[n - i] * ys[i]
            for i in range(max(0, n + 1 - lenxs), min(lenys, n + 1)))
        lo = row[0]
        while heap and heap[0] <= lo:
            yield heappop(heap)
        yield lo
        for u in row[1:]:
            heappush(heap, u)
        #big = max(big, len(heap))
    #print(big)
    #assert not heap

def sorted_prod_block(xs, ys):
    ''' yield the top left corner, then merge sort
        the top row, the left column and the remaining 
        block. So we end up with max(len(xs), len(ys))
        recursively nested calls to merge(). It's ok
        for small lists, but too slow otherwise.
    '''
    if not (xs and ys):
        return
    x, *xs = xs
    y, *ys = ys
    yield x * y
    row = (y * u for u in xs)
    col = (x * v for v in ys)
    yield from merge(row, col, sorted_prod_block(xs, ys))

def sorted_prod_blockI(xs, ys):
    ''' Similar to sorted_prod_block except we use indexing
        to avoid creating sliced copies of the lists
    '''
    lenxs, lenys = len(xs), len(ys)
    def sorted_block(xi, yi):
        if xi == lenxs or yi == lenys:
            return
        x, y = xs[xi], ys[yi]
        yield x * y
        xi, yi = xi + 1, yi + 1
        row = (xs[i] * y for i in range(xi, lenxs))
        col = (ys[i] * x for i in range(yi, lenys))
        yield from merge(row, col, sorted_block(xi, yi))
    yield from sorted_block(0, 0)

functions = (
    sorted_prod_brute,
    upprod1,
    upprod2,
    sorted_prod_merge,
    #sorted_prod_row,
    sorted_prod_diag,
    #sorted_prod_block,
    #sorted_prod_blockI,
)

UB = 1000

def verify(numtests, maxlen=10):
    print('Verifying. maxlen =', maxlen)
    for k in range(numtests):
        lenxs = randrange(maxlen + 1)
        lenys = randrange(maxlen + 1)
        print(k, ':', lenxs, '*', lenys, '=', lenxs * lenys)
        xs = sorted(randrange(UB) for i in range(lenxs))
        ys = sorted(randrange(UB) for i in range(lenys))
        good = list(sorted_prod_brute(xs, ys))

        for func in functions[1:]:
            result = list(func(xs, ys))
            if result != good:
                print(func.__name__, 'failed!')
    print()

def time_test(loops=20):
    timings = []
    for func in functions:
        # Consume the generator output by feeding it to a zero-length deque
        t = Timer(lambda: deque(func(xs, ys), maxlen=0))
        result = sorted(t.repeat(3, loops))
        timings.append((result, func.__name__))
    timings.sort()
    for result, name in timings:
        print('{:18} : {:.6f}, {:.6f}, {:.6f}'.format(name, *result))
    print()

verify(10, 10)
verify(20, 100)

print('\nTimings')
loops = 8192
minlen = 5
for k in range(6):
    lenxs = randrange(minlen, 2 * minlen)
    lenys = randrange(minlen, 2 * minlen)
    print(k, ':', loops, 'loops.', lenxs, '*', lenys, '=', lenxs * lenys)
    xs = sorted(randrange(UB) for i in range(lenxs))
    ys = sorted(randrange(UB) for i in range(lenys))
    time_test(loops)
    minlen *= 2
    loops //= 4

这是我古老的2GHz 32位单核机器的输出,在旧的Debian衍生版Linux上运行Python 3.6.0。 因人而异。

Verifying. maxlen = 10
0 : 8 * 9 = 72
1 : 9 * 0 = 0
2 : 1 * 7 = 7
3 : 8 * 10 = 80
4 : 10 * 5 = 50
5 : 10 * 0 = 0
6 : 5 * 2 = 10
7 : 5 * 10 = 50
8 : 3 * 0 = 0
9 : 0 * 6 = 0

Verifying. maxlen = 100
0 : 64 * 0 = 0
1 : 77 * 96 = 7392
2 : 24 * 13 = 312
3 : 53 * 39 = 2067
4 : 74 * 39 = 2886
5 : 92 * 97 = 8924
6 : 31 * 48 = 1488
7 : 39 * 17 = 663
8 : 42 * 25 = 1050
9 : 94 * 25 = 2350
10 : 82 * 83 = 6806
11 : 2 * 97 = 194
12 : 90 * 30 = 2700
13 : 93 * 24 = 2232
14 : 91 * 37 = 3367
15 : 24 * 86 = 2064
16 : 70 * 15 = 1050
17 : 2 * 4 = 8
18 : 72 * 58 = 4176
19 : 25 * 84 = 2100


Timings
0 : 8192 loops. 7 * 8 = 56
sorted_prod_brute  : 0.659312, 0.665853, 0.710947
upprod1            : 1.695471, 1.705061, 1.739299
sorted_prod_merge  : 1.990161, 1.991129, 2.001242
sorted_prod_diag   : 3.013945, 3.018927, 3.053115
upprod2            : 3.582396, 3.586332, 3.622949

1 : 2048 loops. 18 * 16 = 288
sorted_prod_brute  : 0.826128, 0.840111, 0.863559
upprod1            : 2.240931, 2.241636, 2.244615
sorted_prod_merge  : 2.301838, 2.304075, 2.306918
sorted_prod_diag   : 3.030672, 3.053302, 3.135322
upprod2            : 4.860378, 4.949804, 4.953891

2 : 512 loops. 39 * 32 = 1248
sorted_prod_brute  : 0.907932, 0.918692, 0.942830
sorted_prod_merge  : 2.559567, 2.561709, 2.604387
upprod1            : 2.700482, 2.701147, 2.757695
sorted_prod_diag   : 2.961776, 2.965271, 2.995747
upprod2            : 5.563303, 5.654425, 5.656695

3 : 128 loops. 68 * 70 = 4760
sorted_prod_brute  : 0.823448, 0.827748, 0.835049
sorted_prod_merge  : 2.591373, 2.592134, 2.685534
upprod1            : 2.760466, 2.763615, 2.795082
sorted_prod_diag   : 2.789673, 2.828662, 2.848498
upprod2            : 5.483504, 5.488450, 5.517847

4 : 32 loops. 122 * 156 = 19032
sorted_prod_brute  : 0.873736, 0.880958, 0.892846
sorted_prod_merge  : 2.701089, 2.742456, 2.818822
upprod1            : 2.875358, 2.881793, 2.922569
sorted_prod_diag   : 2.953450, 2.988184, 3.012430
upprod2            : 5.780552, 5.812967, 5.826775

5 : 8 loops. 173 * 309 = 53457
sorted_prod_brute  : 0.711012, 0.711816, 0.721627
sorted_prod_merge  : 1.997386, 1.999774, 2.033489
upprod1            : 2.137337, 2.172369, 3.335119
sorted_prod_diag   : 2.324447, 2.329552, 2.331095
upprod2            : 4.278704, 4.289019, 4.324436

在没有创建列表的情况下,似乎没有任何其他方法可以对这些输出进行排序,因为输出无法在不进行存储的情况下进行排序。 这是你如何做到这一点。

myList = []

for i in range(len(l1)):
    for j in range(len(l2)):
        output = l1[i] * l2[j]
        myList.append(output)
myList.sort()
print(myList)

希望有所帮助。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM