计算大型元组列表中每个元组的最大和最小元素之间差异的高效 pythonic 方法

Question

使用以下代码，我试图计算大型元组列表中每个元组的最大元素和最小元素之间的差异，并将结果存储在列表中。 但是，代码运行了很长时间，然后操作系统将其杀死，因为它消耗了大量 RAM。 大列表是通过从列表中选择n数字生成的，基本上是所有可能的方式，如下面的代码片段所示。 我认为问题就在那里：itertools.combinations，它试图在 memory 中存储大量列表。

我实际上需要每个组合产生的差异的总和，这就是为什么我首先想到我会在列表中获取差异然后调用 sum 的原因。

import itertools

n = 40

lst = [639, 744, 947, 856, 102, 639, 916, 665, 766, 679, 679, 484, 658, 559, 564, 3, 384, 763, 236, 404, 566, 347, 866, 285, 107, 577, 989, 715, 84, 280, 153, 76, 24, 453, 284, 126, 92, 200, 792, 858, 231, 823, 695, 889, 382, 611, 244, 119, 726, 480]

result = [max(x)-min(x) for x in itertools.combinations(lst, n)]

如果有人提供有关解决此问题的提示，这对我来说将是一次很好的学习经历。

Answer 1

与slothrop 的回答中的基本思想相同，但实施方式不同且速度更快。 我使用一个外部循环来确定最大和最小数字之间应存在多少列表数字：

def Kelly2(lst, n):
    lst = sorted(lst)
    total = 0
    for between in range(n-2, len(lst)-1):
        combs = comb(between, n-2)
        diffs = sum(b - a for a, b in zip(lst, lst[between+1:]))
        total += combs * diffs
    return total

从数字between ，我们必须选择n-2 ，因为我们想要 n 包括最小值和最大值。 这个组合的数量对于所有最小/最大对都是相同的，它们between有数字，所以我们只计算一次。 而不是将它与每个最大最小差异相乘，而是将这些差异相加并将它们与组合相乘一次。

这可以更进一步，因为当我们between增加时， combs和diffs都会发生一点变化。 这是使用“50 选择 45”而不是原来的“50 选择 40”（因此 chepner 的蛮力仍然足够快）到“200000 选择 160000”的基准时间：

50 choose 45
 5.281 s  chepner
 0.000 s  slothrop_1
 0.000 s  slothrop_2
 0.000 s  Kelly
 0.000 s  Kelly2
 0.000 s  Kelly3
 0.000 s  Kelly4
 0.000 s  Kelly5

2000 choose 1600
 4.292 s  slothrop_1
 0.064 s  slothrop_2
 0.041 s  Kelly
 0.041 s  Kelly2
 0.037 s  Kelly3
 0.034 s  Kelly4
 0.001 s  Kelly5

10000 choose 8000
 5.036 s  slothrop_2
 3.795 s  Kelly
 3.675 s  Kelly2
 3.622 s  Kelly3
 3.533 s  Kelly4
 0.008 s  Kelly5

100000 choose 80000
 0.527 s  Kelly5

200000 choose 160000
 2.130 s  Kelly5

这是超快的：

def Kelly5(lst, n):
    lst = sorted(lst)
    total = 0
    diffs = sum(lst[n-1:]) - sum(lst[:-n+1])
    combs = 1
    for between in range(n-2, len(lst)-1):
        total += combs * diffs
        combs = combs * (between+1) // (between-n+3)
        diffs += lst[~between-1] - lst[between+1]
    return total

Kelly3和Kelly4是从Kelly2到Kelly5的中间优化，使我更容易看到我是如何到达那里的。

完整代码（在线试用！）：

from time import time
import itertools, math, random
from math import comb

n = 40
lst = [639, 744, 947, 856, 102, 639, 916, 665, 766, 679, 679, 484, 658, 559, 564, 3, 384, 763, 236, 404, 566, 347, 866, 285, 107, 577, 989, 715, 84, 280, 153, 76, 24, 453, 284, 126, 92, 200, 792, 858, 231, 823, 695, 889, 382, 611, 244, 119, 726, 480]


def chepner(lst, n):
    return sum(max(x) - min(x) for x in itertools.combinations(lst, n))


def slothrop_1(lst, n):
  slst = sorted(lst)
  ans = 0
  for i, j in itertools.combinations(range(len(slst)), 2):
    if j < i+n-1:
      continue
    n_comb = math.comb(j-i-1, n-2)
    ans += n_comb * (slst[j] - slst[i])
  return ans


def slothrop_2(lst, n):
  slst = sorted(lst)

  combs = {p: math.comb(p, n-2) for p in range(n-2, len(slst)-1)}

  ans = 0
  for i in range(len(slst)-n+1):
    for j in range(i+n-1, len(slst)):
      ans += combs[j-i-1] * (slst[j] - slst[i])

  return ans


# My original
def Kelly(lst, n):
    lst = sorted(lst)
    return sum(
        comb(between, n-2) * sum(b - a for a, b in zip(lst, lst[between+1:]))
        for between in range(n-2, len(lst)-1)
    )


# Rewritten with loops for the later optimizations
def Kelly2(lst, n):
    lst = sorted(lst)
    total = 0
    for between in range(n-2, len(lst)-1):
        combs = comb(between, n-2)
        diffs = sum(b - a for a, b in zip(lst, lst[between+1:]))
        total += combs * diffs
    return total


# Compute diffs as diff of sums (instead of sum of diffs)
def Kelly3(lst, n):
    lst = sorted(lst)
    total = 0
    for between in range(n-2, len(lst)-1):
        combs = comb(between, n-2)
        diffs = sum(lst[between+1:]) - sum(lst[:~between])
        total += combs * diffs
    return total


# Compute diffs by updating (instead of from scratch)
def Kelly4(lst, n):
    lst = sorted(lst)
    total = 0
    diffs = sum(lst[n-1:]) - sum(lst[:-n+1])
    for between in range(n-2, len(lst)-1):
        combs = comb(between, n-2)
        total += combs * diffs
        diffs += lst[~between-1] - lst[between+1]
    return total


# Compute combs by updating (instead of from scratch)
def Kelly5(lst, n):
    lst = sorted(lst)
    total = 0
    diffs = sum(lst[n-1:]) - sum(lst[:-n+1])
    combs = 1
    for between in range(n-2, len(lst)-1):
        total += combs * diffs
        combs = combs * (between+1) // (between-n+3)
        diffs += lst[~between-1] - lst[between+1]
    return total


funcs = chepner, slothrop_1, slothrop_2, Kelly, Kelly2, Kelly3, Kelly4, Kelly5

#-- Correctness ------------------------------------------

short = lst[:20]
for m in range(2, len(short)+1):
    expect = funcs[0](short, m)
    for f in funcs[1:]:
        result = f(short, m)
        assert result == expect

#-- Speed ------------------------------------------------

# Generate similar larger input data
def gen(N):
    n = N * 8 // 10
    lst = random.choices(range(20 * N), k=N)
    return lst, n

def test(lst, n, funcs):
    print(len(lst), 'choose', n)
    expect = None
    for f in funcs:
        copy = lst[:]
        t = time()
        result = f(copy, n)
        t = time() - t
        print(f'{t:6.3f} s ', f.__name__)
        if expect is None:
            expect = result
        assert result == expect
    print()

test(lst, 45, funcs)
test(*gen(2000), funcs[1:])
test(*gen(10000), funcs[2:])
test(*gen(100000), funcs[-1:])
test(*gen(200000), funcs[-1:])

Answer 2

使用@chepner 回答的评论中概述的方法。

循环中的打印语句显示代码按照我的意图执行，但我没有独立验证整体答案是否正确。

import itertools
import math

n = 40
lst = [639, 744, 947, 856, 102, 639, 916, 665, 766, 679, 679, 484, 658, 559, 564, 3, 384, 763, 236, 404, 566, 347, 866, 285, 107, 577, 989, 715, 84, 280, 153, 76, 24, 453, 284, 126, 92, 200, 792, 858, 231, 823, 695, 889, 382, 611, 244, 119, 726, 480]
slst = sorted(lst)

ans = 0
for i, j in itertools.combinations(range(len(slst)), 2):
  # i and j are candidate *indexes* into slst (not values):
  # we'll count how many combinations have minimum slst[i] and maximum slst[j]
  
  if j < i+n-1:
    # In this case there can't be an n-item combination
    # whose minimum is slst[i] and maximum is slst[j]:
    # there aren't enough items with values in between
    continue

  # How many n-item combinations have minimum slst[i] and maximum slst[j]?
  # It's the number of ways we can pick the (n-2) other members of the combination
  # from the (j-i-1) values between i and j in slst.
  n_comb = math.comb(j-i-1, n-2)
  print(f"{n_comb} combinations with minimum {slst[i]} (index {i}) and maximum {slst[j]} (index {j})")

  # Each of these combinations contributes slst[j] - slst[i] to the sum:
  ans += n_comb * (slst[j] - slst[i])

print(f"Overall sum of differences: {ans}")

结果：

[omitted the lines for individual pairs of indices]
Overall sum of differences: 9965200498117

另一个版本进行了一些优化（避免使用相同的值重复调用math.comb ，并在i, j的相关对上显式循环）：

import math

n = 40
lst = [639, 744, 947, 856, 102, 639, 916, 665, 766, 679, 679, 484, 658, 559, 564, 3, 384, 763, 236, 404, 566, 347, 866, 285, 107, 577, 989, 715, 84, 280, 153, 76, 24, 453, 284, 126, 92, 200, 792, 858, 231, 823, 695, 889, 382, 611, 244, 119, 726, 480]
slst = sorted(lst)

combs = {p: math.comb(p, n-2) for p in range(n-2, len(slst)-1)}

ans = 0
for i in range(len(slst)-n+1):
  for j in range(i+n-1, len(slst)):
    ans += combs[j-i-1] * (slst[j] - slst[i])

print(f"Overall sum of differences: {ans}")

Answer 3

如果你想要总和，你不需要一次将它们全部存储在 memory 中。 只需在计算它们时添加它们。

s = sum(max(x) - min(x) for x in itertools.combinations(lst, n))

迭代所有组合仍然需要一段时间，但这将使用少量常量 memory。

计算大型元组列表中每个元组的最大和最小元素之间差异的高效 pythonic 方法

问题描述

3 个解决方案

解决方案1
3 已采纳 2023-01-31 02:31:53

解决方案2
2 2023-01-30 23:24:37

解决方案3
0 2023-01-30 22:33:51

计算大型元组列表中每个元组的最大和最小元素之间差异的高效 pythonic 方法

问题描述

3 个解决方案

解决方案1 3 已采纳 2023-01-31 02:31:53

解决方案2 2 2023-01-30 23:24:37

解决方案3 0 2023-01-30 22:33:51

解决方案1
3 已采纳 2023-01-31 02:31:53

解决方案2
2 2023-01-30 23:24:37

解决方案3
0 2023-01-30 22:33:51