简体   繁体   中英

Taking random samples from a python generator

I'm using the function for pair in itertools.combinations(bug_map.keys(), 2): to generate all pairs of elements in my db. The problem is that the amount of element is around 6.6 K and so the number of combinations is 21.7 M. Also, combinations are emitted in lexicographic sort order.

Supposing that I would take random pairs from the generator without "yielding" all of them (just a subset of n dimension), what can I do?

If you're allowed to get all 6K elements as a list you first get all of them and then use standard python's random.choices() to generate batch of samples with replacement. Then apply sorting (as combinations are sorted). Then remove tuples that have same element inside twice or more and tuples that are equal. Repeat batch generation till we get enough n number of tuples.

You may specify any k as a length of desired tuples to generate in my code, and n to be the number k-length tuples to generate.

This algorithm generates similar probability distribution pattern as creating all combinations of k length and then choosing random subset of size n .

Try it online!

import random

#random.seed(0) # Do this only for testing to have reproducible random results

l = list(range(1000, 1000 + 6600)) # Example list of all input elements

k = 2 # Length of each tuple to generate
n = 30 # Number of tuples to generate

batch = max(1, n // 4) # Number of k-tuples to sample at once
maybe_sort = lambda x: sorted(x) if is_sorted else x

res = []

while True:
    if len(res) >= n:
        res = res[:n]
        break
    a = random.choices(range(len(l)), k = k * batch) # Generate random samples from inputs with replacement
    a = sorted(res + [tuple(sorted(a[i * k + j] for j in range(k))) for i in range(batch)])
    res = [a[0]]
    for e in a[1:]:
        if all(e0 != e1 for e0, e1 in zip(e[:-1], e[1:])) and res[-1] != e:
            res.append(e)

print([tuple(l[i] for i in tup) for tup in res])

This may seem trivial, but if your desired number of samples is considerably smaller than the total number of possible combinations (21.8 M), then you could just repeatedly generate a ramdom.sample until you have sufficiently many. There may be collisions, but (again, if the required number of samples is comparatively small) the probability for those will be negligible and not cause a slow-down.

import random

lst = range(6000)
n = 1000000
k = 2

samples = set()
while len(samples) < n:
    samples.add(tuple(random.sample(lst, k)))

Even for 1,000,000 random samples, this produced only about ~12k collisions, ie about 1% of "wasted" iterations, which is probably not that much of a problem.

Note that other than combinations , the pairs returned by ramdom.sample are not ordered (the first element can be larger than the second), so you might want to use tuple(sorted(...))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM