简体   繁体   English

从 python 生成器中随机抽取样本

[英]Taking random samples from a python generator

I'm using the function for pair in itertools.combinations(bug_map.keys(), 2): to generate all pairs of elements in my db.for pair in itertools.combinations(bug_map.keys(), 2):使用函数for pair in itertools.combinations(bug_map.keys(), 2):来生成我的数据库中的所有元素对。 The problem is that the amount of element is around 6.6 K and so the number of combinations is 21.7 M. Also, combinations are emitted in lexicographic sort order.问题是元素的数量约为 6.6 K,因此组合数为 21.7 M。此外,组合按字典排序顺序发出。

Supposing that I would take random pairs from the generator without "yielding" all of them (just a subset of n dimension), what can I do?假设我会从生成器中获取随机对而不“产生”所有对(只是 n 维的一个子集),我该怎么办?

If you're allowed to get all 6K elements as a list you first get all of them and then use standard python's random.choices() to generate batch of samples with replacement.如果允许将所有6K元素作为列表获取,则首先获取所有元素,然后使用标准 python 的random.choices()生成一批带有替换的样本。 Then apply sorting (as combinations are sorted).然后应用排序(因为组合已排序)。 Then remove tuples that have same element inside twice or more and tuples that are equal.然后删除两次或更多内部具有相同元素的元组和相等的元组。 Repeat batch generation till we get enough n number of tuples.重复批量生成,直到我们得到足够的n元组的数量。

You may specify any k as a length of desired tuples to generate in my code, and n to be the number k-length tuples to generate.您可以将任何k指定为要在我的代码中生成的所需元组的长度,并将n指定为要生成的 k 长度元组的数量。

This algorithm generates similar probability distribution pattern as creating all combinations of k length and then choosing random subset of size n .该算法生成类似于创建k长度的所有组合然后选择大小为n随机子集的概率分布模式。

Try it online! 在线试试吧!

import random

#random.seed(0) # Do this only for testing to have reproducible random results

l = list(range(1000, 1000 + 6600)) # Example list of all input elements

k = 2 # Length of each tuple to generate
n = 30 # Number of tuples to generate

batch = max(1, n // 4) # Number of k-tuples to sample at once
maybe_sort = lambda x: sorted(x) if is_sorted else x

res = []

while True:
    if len(res) >= n:
        res = res[:n]
        break
    a = random.choices(range(len(l)), k = k * batch) # Generate random samples from inputs with replacement
    a = sorted(res + [tuple(sorted(a[i * k + j] for j in range(k))) for i in range(batch)])
    res = [a[0]]
    for e in a[1:]:
        if all(e0 != e1 for e0, e1 in zip(e[:-1], e[1:])) and res[-1] != e:
            res.append(e)

print([tuple(l[i] for i in tup) for tup in res])

This may seem trivial, but if your desired number of samples is considerably smaller than the total number of possible combinations (21.8 M), then you could just repeatedly generate a ramdom.sample until you have sufficiently many.这可能看起来微不足道,但是如果您想要的样本数量远小于可能的组合总数 (21.8 M),那么您可以重复生成ramdom.sample直到您拥有足够多的ramdom.sample There may be collisions, but (again, if the required number of samples is comparatively small) the probability for those will be negligible and not cause a slow-down.可能会发生冲突,但(同样,如果所需的样本数量相对较少)发生冲突的概率可以忽略不计,不会导致减速。

import random

lst = range(6000)
n = 1000000
k = 2

samples = set()
while len(samples) < n:
    samples.add(tuple(random.sample(lst, k)))

Even for 1,000,000 random samples, this produced only about ~12k collisions, ie about 1% of "wasted" iterations, which is probably not that much of a problem.即使对于 1,000,000 个随机样本,这也仅产生了大约 12k 次碰撞,即大约 1% 的“浪费”迭代,这可能不是什么大问题。

Note that other than combinations , the pairs returned by ramdom.sample are not ordered (the first element can be larger than the second), so you might want to use tuple(sorted(...))请注意,除了combinationsramdom.sample返回的对没有排序(第一个元素可以大于第二个),因此您可能需要使用tuple(sorted(...))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 并行抽取许多随机样本 python - taking many random samples in parallel python 如何在 Python 中从总体中生成随机样本? - How to generate random samples from a population in Python? 有没有办法在随机数生成器中为样本分配概率? - Is there a way to assign probabilities to samples in a random number generator? Python中来自外部文件的随机词生成器 - Random word generator from external file in Python 如何从 Python 3 生成器中检索随机元素? - How to retrieve random elements from a Python 3 generator? 将定制随机生成器从 JS 移植到 Python 3 - Porting bespoke random generator from JS to Python 3 编写一个随机数生成器,它基于 0 和 1 之间的均匀分布的数字,从 Lévy 分布中抽取样本? - Write a random number generator that, based on uniformly distributed numbers between 0 and 1, samples from a Lévy-distribution? 从 Python 中的总体生成具有指定属性的随机样本 - Generate random samples with specified properties from a population in Python 在 SciPy (Python) 中从拟合 PDF 生成随机样本 - Generating random samples from fit PDF in SciPy (Python) 来自具有两个参数的 Gamma 分布的随机样本/Python - Random Samples from Gamma dsitribution with two parameters / Python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM