简体   繁体   中英

Yield a random permutation from a list of generators of known length

I have a sequence of generators which yield objects that require a reasonable amount of memory (they are ipaddress.IPv4Network instances and yielding from them yields a whole ipaddress.IPv4Address instance).

gens = [a, b, c, ...]

Each generator has a deterministic number of elements it will yield, eg:

gen_lens = [17000000, 1024, 8192, ...]

I would like to take batches of yielded values, of n length, in random order. Each item from any of the generators must only be selected once.

My current idea is to get the total number of possible elements that can be yielded (equal to the maximum array index - 1), then iterate through this list in random order using something like the Fisher-Yates-Knuth algorithm, yielding the item of the given random index:

random_indexes = random.shuffle(range(0, sum(gen_lens)))
for i in random_indexes:
    # some windowing logic here to check which generator we should get from and set index appropriately, x = generator index, y = i - sum(gen_lens[0:x])
    yield gens[x][y]

So the end result is, I have a new generator which will yield a random permutation of all elements from my input generators, without having to store all the results of what my sub-generators are yielding.

It still requires to build a list of indexes, which is quite expensive when you have millions of indexes. Is there a way around that? Can anyone suggest a better approach?

Propose: 2 dimensional indexes should be used. Since generating indexes for second dimension beforehand is expensive I am doing it for only one gen at a time

gens = [a, b, c, ...]
gen_lens = [17000000, 1024, 8192, ...]
shuffled_gens_indexes = list(range(len(gens)))
random.shuffle(shuffled_gens_indexes)
for gens_index in shuffled_gens_indexes:
    shuffled_gen_items_indexes = list(range(gen_lens[gens_index]))
    random.shuffle(shuffled_gen_items_indexes) 
    for gen_items_index in shuffled_gen_items_indexes:
        yield gens[gens_index][gen_items_index]

This is very straightforward and simply gives items from one specific randomly-selected-generator at a time.

This is what I did in the end. I believe this solution is better than https://stackoverflow.com/a/65240594/1014237 because it avoids using random.shuffle , so it never stores too many elements (ie it only stores up to your batch_length number of random indexes, instead of up to max(gen_lens) . The work of generating random indexes only takes place when needed.

def get_random_element(data, data_length):
    pos = data_length
    while pos > 0:
        idx = random.randrange(start=0, stop=pos)
        pos -= 1
        if idx != pos:
            data[pos], data[idx] = data[idx], data[pos]
        yield data[pos]


def get_random_idx_generator(n):
    # Create a generator of random indexes, n long
    return get_random_element(list(range(n)), n)

I'm consuming from this generator with itertools.islice so that I only store as many random indexes as I need at the given moment. The function also uses the index and the lengths of the data lists to figure out which it needs to read from.

# Yield a batch_size long list of random IPs, using the random idx generator
def get_randomized_ips_batch(ipnetworks_list, ipnetworks_list_lens,
                             random_idx_generator, batch_size=1024,
                             as_int=False) -> Iterator[Union[ipaddress.IPv4Address, int]]:
    random_indexes_batch = list(itertools.islice(random_idx_generator, batch_size))
    # Figure out which ipnetwork_list our index is pointing to and yield it
    for idx in random_indexes_batch:
        cumulative_len = 0
        gen_idx = 0
        for ipnetwork_len in ipnetworks_list_lens:
            if idx - cumulative_len >= ipnetwork_len:
                cumulative_len += ipnetwork_len
                gen_idx += 1
                continue
            else:
                addr = ipnetworks_list[gen_idx][idx - cumulative_len - 1]
                yield int(addr) if as_int else addr
                break

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM