从 python 生成器中随机抽取样本

Question

I'm using the function for pair in itertools.combinations(bug_map.keys(), 2): to generate all pairs of elements in my db.我for pair in itertools.combinations(bug_map.keys(), 2):使用函数for pair in itertools.combinations(bug_map.keys(), 2):来生成我的数据库中的所有元素对。 The problem is that the amount of element is around 6.6 K and so the number of combinations is 21.7 M. Also, combinations are emitted in lexicographic sort order.问题是元素的数量约为 6.6 K，因此组合数为 21.7 M。此外，组合按字典排序顺序发出。

Supposing that I would take random pairs from the generator without "yielding" all of them (just a subset of n dimension), what can I do?假设我会从生成器中获取随机对而不“产生”所有对（只是 n 维的一个子集），我该怎么办？

Answer 1

If you're allowed to get all 6K elements as a list you first get all of them and then use standard python's random.choices() to generate batch of samples with replacement.如果允许将所有6K元素作为列表获取，则首先获取所有元素，然后使用标准 python 的random.choices()生成一批带有替换的样本。 Then apply sorting (as combinations are sorted).然后应用排序（因为组合已排序）。 Then remove tuples that have same element inside twice or more and tuples that are equal.然后删除两次或更多内部具有相同元素的元组和相等的元组。 Repeat batch generation till we get enough n number of tuples.重复批量生成，直到我们得到足够的n元组的数量。

You may specify any k as a length of desired tuples to generate in my code, and n to be the number k-length tuples to generate.您可以将任何k指定为要在我的代码中生成的所需元组的长度，并将n指定为要生成的 k 长度元组的数量。

This algorithm generates similar probability distribution pattern as creating all combinations of k length and then choosing random subset of size n .该算法生成类似于创建k长度的所有组合然后选择大小为n随机子集的概率分布模式。

Try it online! 在线试试吧！

import random

#random.seed(0) # Do this only for testing to have reproducible random results

l = list(range(1000, 1000 + 6600)) # Example list of all input elements

k = 2 # Length of each tuple to generate
n = 30 # Number of tuples to generate

batch = max(1, n // 4) # Number of k-tuples to sample at once
maybe_sort = lambda x: sorted(x) if is_sorted else x

res = []

while True:
    if len(res) >= n:
        res = res[:n]
        break
    a = random.choices(range(len(l)), k = k * batch) # Generate random samples from inputs with replacement
    a = sorted(res + [tuple(sorted(a[i * k + j] for j in range(k))) for i in range(batch)])
    res = [a[0]]
    for e in a[1:]:
        if all(e0 != e1 for e0, e1 in zip(e[:-1], e[1:])) and res[-1] != e:
            res.append(e)

print([tuple(l[i] for i in tup) for tup in res])

Answer 2

This may seem trivial, but if your desired number of samples is considerably smaller than the total number of possible combinations (21.8 M), then you could just repeatedly generate a ramdom.sample until you have sufficiently many.这可能看起来微不足道，但是如果您想要的样本数量远小于可能的组合总数 (21.8 M)，那么您可以重复生成ramdom.sample直到您拥有足够多的ramdom.sample 。 There may be collisions, but (again, if the required number of samples is comparatively small) the probability for those will be negligible and not cause a slow-down.可能会发生冲突，但（同样，如果所需的样本数量相对较少）发生冲突的概率可以忽略不计，不会导致减速。

import random

lst = range(6000)
n = 1000000
k = 2

samples = set()
while len(samples) < n:
    samples.add(tuple(random.sample(lst, k)))

Even for 1,000,000 random samples, this produced only about ~12k collisions, ie about 1% of "wasted" iterations, which is probably not that much of a problem.即使对于 1,000,000 个随机样本，这也仅产生了大约 12k 次碰撞，即大约 1% 的“浪费”迭代，这可能不是什么大问题。

Note that other than combinations , the pairs returned by ramdom.sample are not ordered (the first element can be larger than the second), so you might want to use tuple(sorted(...))请注意，除了combinations ， ramdom.sample返回的对没有排序（第一个元素可以大于第二个），因此您可能需要使用tuple(sorted(...))

从 python 生成器中随机抽取样本

问题描述

2 个解决方案

解决方案1
0 2020-09-25 08:51:59

解决方案2
0 2020-09-25 10:14:55

从 python 生成器中随机抽取样本

问题描述

2 个解决方案

解决方案1 0 2020-09-25 08:51:59

解决方案2 0 2020-09-25 10:14:55

解决方案1
0 2020-09-25 08:51:59

解决方案2
0 2020-09-25 10:14:55