遍歷笛卡爾積的子集，其中所有元素均（幾乎）均等地選擇

Question

我有一大組樣本，可以用三個參數來描述（我們稱它們a 、 b和c ），例如在元組(a, b, c)中，每個參數都可以有有限數量的值。 例如， a有 25 個可能值（索引為 0..24）， b有 20 個可能值， c有 3 個可能值。 此數據集中表示a 、 b和c的每個組合，因此在此示例中，我的數據集有 1500 個樣本 (25 × 20 × 3)。

我想隨機 select 來自該數據集的n 個樣本的子集（不重復）。 但是，此隨機樣本必須具有以下屬性： a 、 b和c的所有可能值均等表示（或盡可能接近均等，如果所選樣本的數量不能被參數的可能值的數量整除）。

例如，如果我 select 100 個樣本，我希望a每個值表示 4 次， b的每個值表示 5 次， c的每個值表示 33 次（一個值可以表示 34 次滿足選擇的樣本總數，這與哪個值無關）。 我不關心(a, b, c)的確切組合，只要每個參數值出現的總次數是正確的。

我目前的實現如下：

import random

n_a = 25
n_b = 20
n_c = 3

n_desired = 100

# generate random ordering for selections
order_a = random.sample(range(n_a), k=n_a)
order_b = random.sample(range(n_b), k=n_b)
order_c = random.sample(range(n_c), k=n_c)

# select random samples
samples = []
for i in range(n_desired):
    idx_a = order_a[i % n_a]
    idx_b = order_b[i % n_b]
    idx_c = order_c[i % n_c]

    samples.append((idx_a, idx_b, idx_c))

（我知道這段代碼可以寫得有點不同，例如使用列表推導或使用itertools.cycle而不是i % n索引來組合a 、 b和c上的所有操作，但我發現這更具可讀性，也是因為a , b和c在原始代碼中具有有意義但與此問題無關的名稱。）

通過生成a 、 b和c的可能值的隨機排序並循環遍歷它們，我們確保參數值的出現次數不超過 1（首先，所有參數值選擇一次，然后選擇兩次，然后三次，等等）

我們可以驗證此代碼是否達到了預期的結果（所有可能的參數值的相等表示（±1））：

from collections import Counter

count_a = Counter()
count_b = Counter()
count_c = Counter()

count_a.update(sample[0] for sample in samples)
count_b.update(sample[1] for sample in samples)
count_c.update(sample[2] for sample in samples)

print(f'a values are represented between {min(count_a.values())} and {max(count_a.values())} times')
print(f'b values are represented between {min(count_b.values())} and {max(count_b.values())} times')
print(f'c values are represented between {min(count_c.values())} and {max(count_c.values())} times')

這將打印以下結果：

a values are represented between 4 and 4 times
b values are represented between 5 and 5 times
c values are represented between 33 and 34 times

我們還可以驗證此代碼不會 select 重復a 、 b和c的組合，使用它們不允許重復值的集合的屬性：

print(len(set(samples)))

這將打印100 ，與n_desired的值匹配。

但是，此實現的一個問題是它僅在n_desired ≤ lcm( n_a , n_b , n_c ) 時才有效，其中 lcm() 是最小公倍數（可被n_a 、 n_b和 n_c 整除的最小正n_c ）。 在我們的示例中，lcm( n_a , n_b , n_c ) = lcm(25, 20, 3) = 300。如果我們在n_desired > 300 的情況下運行上述實現，我們將看到所選樣本以 300 的周期重復。這是不受歡迎的，因為這忽略了 80% 的原始數據集，並且不允許我們 select 超過 300 個獨特的樣本。

一個簡單的解決方案是確保 lcm( n_a , n_b , n_c ) = n_a × n_b × n_c ，如果這三個都是素數的話。 但是，我希望該算法適用於任何值，部分原因是我無法確保所有值都是素數（例如，在我的應用程序中， n_a始終是 integer 平方的結果）。

簡單地使用itertools.product(range(n_a), range(n_b), range(n_c))生成一個列表，為我提供了所有可能的組合，但這些組合是按順序排列的，通過打亂這個完整列表並選擇第一個n_desired樣本，我們失去了所有可能的參數值的相等表示的屬性。

這就是我陷入困境的地方，因為我在組合學方面的知識不足以解決這個問題，也不知道我需要搜索哪些術語才能找到解決方案。 我將如何解決這個問題？

Answer 1

您可以生成a 、 b和c的所有隨機值（每個n_desired值的列表），然后將它們組合成一個數組。

import random

# n is the maximal value to generate
# k is the number of samples, i.e. length of the resulting list
def generate(n, k):
    # the values that are evenly distributed
    l1 = list(range(n)) * (k // n)
    # remaining values that are generated one time more than another ones
    l2 = random.sample(range(n), k % n)
    l = l1 + l2
    random.shuffle(l)
    return l
    
n_a = 25
n_b = 20
n_c = 3
n_desired = 100
l = list(zip(generate(n_a, n_desired), generate(n_b, n_desired), generate(n_c, n_desired)))
print(l)

可以刪除重復項，然后使用此 function 重新采樣。 首先，它將列表拆分為唯一的和重復的樣本。 然后它嘗試重新排列重復值，以便生成新的唯一樣本。 如果它未能減少重復的數量，那么它會嘗試刪除一些獨特的樣本並使用它們來生成新的樣本。

def remove_duplicates(l):
    unique = set()
    duplicates = []
    for t in l:
        if t in unique:
            duplicates.append(t)
        else:
            unique.add(t)
    n_duplicates = len(duplicates)
    
    # iterations = 0
    # n_retries = 0
    while n_duplicates > 0:
        while n_duplicates > 0:
            # iterations += 1
            # print(n_duplicates)
            a, b, c = map(list, zip(*(duplicates)))
            for x in a, b, c:
                random.shuffle(x)
            duplicates = []
            for t in zip(a, b, c):
                if t in unique:
                    duplicates.append(t)
                else:
                    unique.add(t)
            if len(duplicates) == n_duplicates:
                break
            n_duplicates = len(duplicates)
        if n_duplicates > 0:
            # n_retries += 1
            n_recycled = min(n_duplicates, len(unique))
            recycled = random.sample(list(unique), n_recycled)
            unique = unique - set(recycled)
            duplicates += recycled
    # print(iterations, n_retries)
    return unique

如果n_desired小於所有可能樣本（ n_a * n_b * n_c ）的一半，則效果很好，但否則需要大量迭代才能完成。 這個問題可以通過生成不包含在最終集合中的樣本來解決：

if n_desired <= n_a * n_b * n_c // 2:
    result = generate_samples(n_a, n_b, n_c, n_desired)
else:
    excluded = generate_samples(n_a, n_b, n_c, n_a * n_b * n_c - n_desired)
    all_samples = set(itertools.product(range(n_a), range(n_b), range(n_c)))
    result = all_samples - set(excluded)

Answer 2

這比我預期的要復雜得多，我最終得到了相當多的代碼：

from math import prod
from itertools import product
from random import shuffle

def sample(n, ns):
    # make sure parameters are valid
    if n > prod(ns):
        raise ValueError("more values requested than unique combinations", n, ns)

    # "remain" keeps track of the remaining counts for each item
    remain = []
    for n_i in ns:
        k, m = divmod(n, n_i)
        # start with the whole number
        d = {i: k for i in range(n_i)}
        # add in the remainders
        if m:
            r = list(range(n_i))
            shuffle(r)
            for i in r[:m]:
                d[i] += 1
        # sanity check
        assert(sum(d.values()) == n)

        remain.append(d)

    # generate list of all available options in random order
    opts = list(product(*(range(n_i) for n_i in ns)))
    shuffle(opts)

    result = []
    for _ in range(n):
        # get next random item, fails if we've been unlucky
        tup = opts.pop()
        result.append(tup)
        
        # keep track of remaining counts
        for i, (rem, a) in enumerate(zip(remain, tup)):
            j = rem[a]
            if j > 1:
                rem[a] = j - 1
            else:
                del rem[a]
                # remove options that involve a number that's been used up
                opts[:] = filter(lambda t: t[i] != a, opts)

    # we're done
    return result

可以用作：

x = sample(100, (25, 20, 3))

請注意，這首先生成所有可能的選項。 對於您的參數，這似乎是一個合理的權衡，但如果有數十億個可能的選項，您不應該使用此算法。

另請注意，大的n s 會導致此算法失敗，請參閱下面的 plot。

n的成功率

隨意提出改進建議，或者只是將其放入在IndexError上重試的循環中！

遍歷笛卡爾積的子集，其中所有元素均（幾乎）均等地選擇

問題描述

2 個解決方案

解決方案1
1 已采納 2020-12-15 12:38:50

解決方案2
1 2020-12-15 17:52:17

遍歷笛卡爾積的子集，其中所有元素均（幾乎）均等地選擇

問題描述

2 個解決方案

解決方案1 1 已采納 2020-12-15 12:38:50

解決方案2 1 2020-12-15 17:52:17

解決方案1
1 已采納 2020-12-15 12:38:50

解決方案2
1 2020-12-15 17:52:17