以最小差异大于 Python 列表中的值对大多数数字进行采样的最快方法

Question

Given a list of 20 float numbers, I want to find a largest subset where any two of the candidates are different from each other larger than a mindiff = 1. .给定一个包含 20 个浮点数的列表，我想找到一个最大的子集，其中任意两个候选者彼此之间的差异大于mindiff = 1. . Right now I am using a brute-force method to search from largest to smallest subsets using itertools.combinations .现在我正在使用一种蛮力方法使用itertools.combinations从最大到最小的子集进行搜索。 As shown below, the code finds a subset after 4 s for a list of 20 numbers.如下图，代码在 4 s 后为 20 个数字的列表找到了一个子集。

from itertools import combinations
import random
from time import time

mindiff = 1.
length = 20
random.seed(99)
lst = [random.uniform(1., 10.) for _ in range(length)]

t0 = time()
n = len(lst)
sample = []
found = False
while not found:
    # get all subsets with size n
    subsets = list(combinations(lst, n))
    # shuffle to ensure randomness
    random.shuffle(subsets)
    for subset in subsets:
        # sort the subset numbers
        ss = sorted(subset)
        # calculate the differences between every two adjacent numbers
        diffs = [j-i for i, j in zip(ss[:-1], ss[1:])]
        if min(diffs) > mindiff:
            sample = set(subset)
            found = True
            break
    # check subsets with size -1
    n -= 1

print(sample)
print(time()-t0)

Output:输出：

{2.3704888087015568, 4.365818049020534, 5.403474619948962, 6.518944556233767, 7.8388969285727015, 9.117993839791751}
4.182451486587524

However, in reality I have a list of 200 numbers, which is infeasible for a brute-froce enumeration.但是，实际上我有一个包含 200 个数字的列表，这对于粗暴枚举是不可行的。 I want a fast algorithm to sample just one random largest subset with a minimum difference larger than 1. Note that I want each sample has randomness and maximum size.我想要一种快速算法来仅对一个最小差异大于 1 的随机最大子集进行采样。请注意，我希望每个样本都具有随机性和最大大小。 Any suggestions?有什么建议？

Answer 1

My previous answer assumed you simply wanted a single optimal solution, not a uniform random sample of all solutions.我之前的回答假设您只想要一个最佳解决方案，而不是所有解决方案的统一随机样本。 This answer assumes you want one that samples uniformly from all such optimal solutions.这个答案假设您想要一个从所有这些最佳解决方案中统一采样的答案。

Construct a directed acyclic graph G where there is one node for each point, and nodes a and b are connected when b - a > mindist .构造一个有向无环图G ，其中每个点有一个节点，当b - a > mindist时，节点a和b相连。 Also add two virtual nodes, s and t , where s -> x for all x and x -> t for all x .还要添加两个虚拟节点s和t ，其中s -> x代表所有x和x -> t代表所有x 。
Calculate for each node in G how many paths of length k exist to t .对于G每个节点，计算到t存在多少条长度为k路径。 You can do this efficiently in O(n^2 k) time using dynamic programming with a table P[x][k] , filling initially P[x][0] = 0 except P[t][0] = 1 , and then P[x][k] = sum(P[y][k-1] for y in neighbors(x)) .您可以使用带有表P[x][k]动态编程在O(n^2 k)时间内有效地完成此操作，最初填充P[x][0] = 0除了P[t][0] = 1 ，然后P[x][k] = sum(P[y][k-1] for y in neighbors(x)) 。
Keep doing this until you reach the maximum k - you now know the size of the optimal subset.继续这样做，直到达到最大k - 您现在知道最佳子集的大小。
Uniformly sample a path of length k from s to t using P to weight your choices.使用P对从s到t的长度为k的路径进行均匀采样以加权您的选择。
This is done by starting at s .这是通过从s开始完成s 。 We then look at each neighbor of s and choose one randomly with a weighting dictated by P[s][k] .然后我们查看s每个邻居并随机选择一个，其权重由P[s][k] 。 This gives us our first element of the optimal set.这给了我们最优集合的第一个元素。
We then repeatedly perform this step.然后我们重复执行这一步。 We are at x , look at the neighbors of x and pick one randomly using weights P[x][ki] where i is the step we're at.我们正处在x ，看的邻居x和选择一个随机使用权P[x][ki]这里i是步骤我们在。
Use the nodes you sampled in 3 as your random subset.使用您在 3 中采样的节点作为您的随机子集。

An implementation of the above in pure Python:在纯 Python 中实现上述内容：

import random

def sample_mindist_subset(xs, mindist):
    # Construct directed graph G.
    n = len(xs)
    s = n; t = n + 1  # Two virtual nodes, source and sink.
    neighbors = {
        i: [t] + [j for j in range(n) if xs[j] - xs[i] > mindist]
        for i in range(n)}
    neighbors[s] = [t] + list(range(n))
    neighbors[t] = []

    # Compute number of paths P[x][k] from x to t of length k.
    P = [[0 for _ in range(n+2)] for _ in range(n+2)]
    P[t][0] = 1
    for k in range(1, n+2):
        for x in range(n+2):
            P[x][k] = sum(P[y][k-1] for y in neighbors[x])

    # Sample maximum length path uniformly at random.
    maxk = max(k for k in range(n+2) if P[s][k] > 0)
    path = [s]
    while path[-1] != t:
        candidates = neighbors[path[-1]]
        weights = [P[cn][maxk-len(path)] for cn in candidates]
        path.append(random.choices(candidates, weights)[0])

    return [xs[i] for i in path[1:-1]]

Note that if you want to sample from the same set of numbers many times, you don't have to recompute P every single time and can re-use it.请注意，如果您想从同一组数字中多次采样，则不必每次都重新计算P并且可以重复使用它。

Answer 2

I probably don't fully understand the question, because right now the solution is quite trivial.我可能不完全理解这个问题，因为现在解决方案非常简单。 EDIT: yes, I misunderstood after all, the OP does not just want an optimal solution, but wishes to randomly sample from the set of optimal solutions .编辑：是的，毕竟我误解了，OP 不仅想要一个最佳解决方案，而且希望从一组最佳解决方案中随机抽样。 This answer is not incorrect but it also is an answer to a different question than what OP is interested in.这个答案并没有错，但它也是对与 OP 感兴趣的问题不同的问题的答案。

Simply sort the numbers and greedily construct the subset:简单地对数字进行排序并贪婪地构造子集：

def mindist_subset(xs, mindist):
    result = []
    for x in sorted(xs):
        if not result or x - result[-1] > mindist:
            result.append(x)
    return result

Sketch of proof of correctness.正确性证明草图。

Suppose we have a solution S given input array A that is of optimal size.假设我们有一个给定输入数组A的最优解S If it does not contain min(A) note that we could remove min(S) from S and add min(A) since this would only increase the distance between min(S) and the second smallest number in S .如果它不包含min(A)需要注意的是，我们可以删除min(S)从S ，并添加min(A)因为这只会增加之间的距离min(S)和第二小的号码S 。 Conclusion: we can without loss of generality assume that min(A) is part of an optimal solution.结论：我们可以不失一般性地假设min(A)是最优解的一部分。

Now we can apply this argument recursively.现在我们可以递归地应用这个论点。 We add min(A) to a solution and remove all elements too close to min(A) , giving remaining elements A' .我们将min(A)添加到一个解决方案中，并删除所有离min(A)太近的元素，给出剩余的元素A' 。 Then we're left with a subproblem where exactly the same argument applies, we can choose min(A') as our next element of the solution, etc.然后我们剩下一个子问题，其中应用完全相同的参数，我们可以选择min(A')作为解决方案的下一个元素，等等。

以最小差异大于 Python 列表中的值对大多数数字进行采样的最快方法

问题描述

2 个解决方案

解决方案1
11 已采纳 2021-06-18 19:08:15

解决方案2
7 2021-06-18 18:25:31

以最小差异大于 Python 列表中的值对大多数数字进行采样的最快方法

问题描述

2 个解决方案

解决方案1 11 已采纳 2021-06-18 19:08:15

解决方案2 7 2021-06-18 18:25:31

解决方案1
11 已采纳 2021-06-18 19:08:15

解决方案2
7 2021-06-18 18:25:31