简体   繁体   中英

Generating a random (equal probability) combination with replacement

I want to generate one random combination out of all possible combinations_with_replacement . The tricky bit is that I want each of the possible outcomes to have the same probability without needing to generate (not even implicit) all the possible outcomes.

For example:

import itertools
import random

random.choice(list(itertools.combinations_with_replacement(range(4), 2)))

That approach is way too slow (and memory expensive) because it needs to create all possible combinations whereas I only want one.

It's not so bad if I determine how many combinations_with_replacement there will be and use random.randrange together with next and itertools.islice on the itertools.combinations_with_replacement . That doesn't need to generate all possible combinations (except in the worst-case). But it's still too slow.

On the other hand the recipe mentioned in the itertools documentation is fast but not each combination has the same probability.

Well, I'm in a bit of a dilemma, because I've found an algorithm that works, but I don't know why. So do what you want of if, maybe some mathematician in the room can work out the probabilities, but it does empirically work. The idea is to pick one element at a time, increasing the probability of the selected elements. I suspect the reasoning must be similar to that of reservoir sampling , but I didn't work it out.

from random import choice
from itertools import combinations_with_replacement

population = ["A", "B", "C", "D"]
k = 3

def random_comb(population, k):
    idx = []
    indices = list(range(len(population)))
    for _ in range(k):
        idx.append(choice(indices))
        indices.append(idx[-1])
    return tuple(population[i] for i in sorted(idx))

combs = list(combinations_with_replacement(population, k))
counts = {c: 0 for c in combs}

for _ in range(100000):
    counts[random_comb(population, k)] += 1

for comb, count in sorted(counts.items()):
    print("".join(comb), count)

The output is the number of times each possibility has appeared after 100,000 runs:

AAA 4913
AAB 4917
AAC 5132
AAD 4966
ABB 5027
ABC 4956
ABD 4959
ACC 5022
ACD 5088
ADD 4985
BBB 5060
BBC 5070
BBD 5056
BCC 4897
BCD 5049
BDD 5059
CCC 5024
CCD 5032
CDD 4859
DDD 4929

As you did not provide any estimates for the parameters in your task: here some approach for small k .

The basic idea: acceptance-rejection sampling with a full-restart if some partial-solution is infeasible (according to the sorted-characteristic). Of course the probability of not restarting is decreasing with k! ( compare with bogosort ). There is no extra memory used.

The following code compares this approach with the original, a wrong naive one, and a wrong one based on the other (now deleted) answer (which had an upvote). The code is pretty much garbage and just for demo-purporses:

Code:

import itertools
import random
from time import perf_counter
from collections import deque
n = 30
k = 4
its = 100000  # monte-carlo analysis -> will take some time with these values!

sample_space = itertools.combinations_with_replacement(range(n), k)
flat_map = {}  # for easier counting / analysis
for ind, i in enumerate(sample_space):
    flat_map[i] = ind

def a(n, k):
    """ Original slow approach """
    return random.choice(list(itertools.combinations_with_replacement(range(n), k)))

def b(n, k):
    """ Naive attempt -> non-uniform """
    chosen = [random.choice(list(range(n))) for i in range(k)]
    return tuple(sorted(chosen))

def c(population, k):
  """ jdehesa solution (hopefully not broken by my modifications) """
  choices = [i for i in range(population) for _ in range(k)]
  return tuple([i for i in sorted(random.sample(choices, k))])

def d(n, k):
    """ Acceptance-rejection sampling with restart using python's list """
    chosen = []
    while True:
        if len(chosen) == k:
            return tuple(chosen)
        else:
            new_element = random.randint(0, n-1)
            if len(chosen) > 0:
                if new_element >= chosen[-1]:
                    chosen.append(new_element)
                else:
                    chosen = []
            else:
                chosen.append(new_element)
    return chosen

def d2(n, k):
    """ Acceptance-rejection sampling with restart using deque """

    chosen = deque()
    while True:
        if len(chosen) == k:
            return tuple(chosen)
        else:
            new_element = random.randint(0, n-1)
            if len(chosen) > 0:
                if new_element >= chosen[-1]:
                    chosen.append(new_element)
                else:
                    chosen = []
            else:
                chosen.append(new_element)
    return chosen

start = perf_counter()
a_result = [flat_map[a(n, k)] for i in range(its)]
print('s: ', perf_counter() - start)

start = perf_counter()
b_result = [flat_map[b(n, k)] for i in range(its)]
print('s: ', perf_counter() - start)

start = perf_counter()
c_result = [flat_map[c(n, k)] for i in range(its)]
print('s: ', perf_counter() - start)

start = perf_counter()
d_result = [flat_map[d(n, k)] for i in range(its)]
print('s: ', perf_counter() - start)

start = perf_counter()
d2_result = [flat_map[d2(n, k)] for i in range(its)]
print('s: ', perf_counter() - start)

import matplotlib.pyplot as plt

f, arr = plt.subplots(5, sharex=True, sharey=True)
arr[0].hist(a_result, label='original')
arr[1].hist(b_result, label='naive (non-uniform)')
arr[2].hist(c_result, label='jdehesa (non-uniform)')
arr[3].hist(d_result, label='Acceptance-rejection restart -> list')
arr[4].hist(d2_result, label='Acceptance-rejection restart  -> deque')

for i in range(5):
    arr[i].legend()

plt.show()

Output:

s:  546.1523445801055
s:  1.272424016672062
s:  3.058098026099742
s:  12.665841491509354
s:  13.14264200539003

在此输入图像描述

Yes, i put those labels in some sub-optimal position.

Alternative timings:

Only comparing original with deque-based AR-sampling. Also only relative-timings matter here.

n=100, k=3 :

s:  22.6498539618067
s:  0.038274503506364965

n=100, k=4 :

s:  7.047153613584993
s:  0.0009363589822841689

Remark: one might argue that the original approach should re-use the sample-space (which will shift those benchmarks) if memory allows this storage at all.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM