python中具有不同排列的大数据集

Question

I have a combination of 50 letters, and I need distinct permutations of all of them printed into a csv file. 我有50个字母的组合，我需要将所有这些字母的不同排列打印到一个csv文件中。 Right now I was using more_itertools method of distinct_permutations to create the list. 现在，我正在使用distinct_permutations的more_itertools方法来创建列表。 Within the 50 letters, 40 of them are the same, and the rest 10 letters are the same. 在这50个字母中，其中40个是相同的，其余10个字母是相同的。 I used Mathematica to check the possible combos, (50!/(40! * 10!)), and there are more than 10 billions of them, so I wonder if the "distinct_permutation" the most efficient way of doing it? 我使用Mathematica检查了可能的组合（50！/（40！* 10！）），其中有超过100亿个，所以我想知道“ distinct_permutation”是否是最有效的组合方式？ Because i was running this code since this morning and it's still running. 因为自从今天早上以来我一直在运行此代码，并且它仍在运行。 Thanks. 谢谢。

Answer 1

Are you aware that data will occupy about terabyte on your hard disk? 您是否知道硬盘上的数据将占用大约TB？ ;) ;）
(and writing will take about 6 hours for usual HD's) （通常的高清写大约需要6个小时）

This problem is equivalent to generation of combinations. 此问题等效于组合的生成。 You can try itertools combinations method. 您可以尝试itertools combinations方法。 If it is slow too, consider using bit arithmetics. 如果速度也很慢，请考虑使用位运算。

With only two types of letters the problem is similar to generation of all 50-bit numbers containing 10 ones. 仅使用两种类型的字母，问题类似于生成包含10个字母的所有50位数字。 There is fast way to produce these bit patterns. 有一种快速的方法可以产生这些位模式。 During generation convert every bit pattern to letter combination (there are concise ways to map binary to your alphabet in Python, but I don't know the fastest way). 在生成过程中，将每个位模式转换为字母组合（有简洁的方法可以在Python中将二进制映射到您的字母，但是我不知道最快的方法）。

Short example: 简短示例：

def nextperm(v):
    t = (v | (v - 1)) + 1
    w = t | ((((t & -t) // (v & -v)) >> 1) - 1)
    return w

v = 0b0011
print("{0:b}".format(v))
while (v != 0b1100):
    v = nextperm(v)
    print("{0:b}".format(v))

gives output 提供输出

that corresponds to 对应于

AABB
ABAB    
ABBA
BAAB
BABA
BBAA

In my experiment generation of 10^8 steps (1/100 of your full range) for initial pattern v = 0b00000000000000000000000000000000000000001111111111 without output took 60 seconds 在我的实验生成的10 ^ 8的步骤（也就是完整范围的1/100）为初始图案v = 0b00000000000000000000000000000000000000001111111111 没有输出了60秒

Edit: one more experiment with partial real output . 编辑：使用部分实际输出进行的另一项实验。 I am sure that building a string might be performed faster, but don't know the best way in Python. 我敢肯定，构建字符串的速度可能会更快，但我不知道Python的最佳方法。 My implementation generates 50-MBytes file in 13 seconds (1/10000 of real size), so full generation will take 1.5 days. 我的实现在13秒内（实际大小的1/10000）生成了50 MB的文件，因此完整的生成将花费1.5天。 Good implemenation of string building (and usage of faster language instead of Python) might give gain up to 10 times. 字符串构建的良好实现（以及使用更快的语言代替Python）可能会带来多达10倍的收益。

def nextperm(v):
    t = (v | (v - 1)) + 1
    w = t | ((((t & -t) // (v & -v)) >> 1) - 1)
    return w

def writeout(v):
    outs = ""
    t = v
    for i in range(50):
       outs = alphabet[(t & 1)] + outs
       t = t >> 1
    my_file.write(outs + "\n")

v = 0b00000000000000000000000000000000000000001111111111
alphabet = "AB"
my_file = open("out.txt", "w")
for i in range(1000000):
#while (v != 0b11111111110000000000000000000000000000000000000000):
    writeout(v)
    v = nextperm(v)
writeout(v)
my_file.close()

Also you can try to implement 'next permutation' algorithm on numpy arrays of letters to provide faster output. 您也可以尝试对numpy个字母数组实施“下一个置换”算法，以提供更快的输出。

python中具有不同排列的大数据集

问题描述

1 个解决方案

解决方案1
1 2018-06-27 06:22:47

python中具有不同排列的大数据集

问题描述

1 个解决方案

解决方案1 1 2018-06-27 06:22:47

解决方案1
1 2018-06-27 06:22:47