简体   繁体   English

在非常大的数据集上用python生成n选择2个组合

[英]Generate n choose 2 combinations in python on very large data sets

I need to create n choose 2 combinations and am currently using pythons itertools.combinations module. 我需要创建n选择2个组合,并且当前正在使用pythons itertools.combinations模块。

For a single list of 30,000 strings, creating the combinations runs for hours and uses many gigs of ram, ie 对于30,000个字符串的单个列表,创建组合需要花费几个小时,并使用大量的ram,即

list(itertools.combinations(longlist,2))

is there a method of generating combinations that is potentially better optimized for large objects in memory? 是否有一种生成组合的方法,该组合可能针对内存中的大对象进行了更好的优化? Alternately is there a way of using numpy to speed up the process? 或者,有没有一种方法可以使用numpy来加快过程?

you can instantly know how many combinations there are by using binomial coeficient there are (30k choose 2) way to solve this = math.factorial(30000)//(math.factorial(2)*math.factorial(30000-2)) = 449985000 combinations 您可以通过使用二项式系数(30k选择2)来立即知道有多少种组合来解决此问题= math.factorial(30000)//(math.factorial(2)*math.factorial(30000-2)) = 449985000组合

that said itertools returns a generator so you can iterate over it without loading all the combinations in memory into one big list 表示itertools返回了一个生成器,因此您可以对其进行迭代,而无需将内存中的所有组合加载到一个大列表中

I'd use a generator based on np.triu_indices 我会使用基于np.triu_indices的生成器
These are the indices of the upper trianle of an nxn square matrix, where n = len(long_list) 这些是nxn方阵的上三角的索引,其中n = len(long_list)

The problem is that the entire set of indices are created first. 问题是首先创建整个索引集。 itertools does not do this and only generates each combination one at a time. itertools不会这样做,只会一次生成一个组合。

def combinations_of_2(l):
    for i, j in zip(*np.triu_indices(len(l), 1)):
        yield l[i], l[j]

long_list = list('abc')
c = combinations_of_2(long_list)
list(c)

[('a', 'b'), ('a', 'c'), ('b', 'c')]

To get them all at once 一次获得所有

a = np.array(long_list)
i, j = np.triu_indices(len(a), 1)
np.stack([a[i], a[j]]).T

array([['a', 'b'],
       ['a', 'c'],
       ['b', 'c']], 
      dtype='<U1')

timing 定时
long_list = pd.DataFrame(np.random.choice(list(ascii_letters), (3, 1000))).sum().tolist()
在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM