[英]Fastest way to sample most numbers with minimum difference larger than a value from a Python list
Given a list of 20 float numbers, I want to find a largest subset where any two of the candidates are different from each other larger than a mindiff = 1.
.给定一个包含 20 个浮点数的列表,我想找到一个最大的子集,其中任意两个候选者彼此之间的差异大于
mindiff = 1.
. Right now I am using a brute-force method to search from largest to smallest subsets using itertools.combinations
.现在我正在使用一种蛮力方法使用
itertools.combinations
从最大到最小的子集进行搜索。 As shown below, the code finds a subset after 4 s for a list of 20 numbers.如下图,代码在 4 s 后为 20 个数字的列表找到了一个子集。
from itertools import combinations
import random
from time import time
mindiff = 1.
length = 20
random.seed(99)
lst = [random.uniform(1., 10.) for _ in range(length)]
t0 = time()
n = len(lst)
sample = []
found = False
while not found:
# get all subsets with size n
subsets = list(combinations(lst, n))
# shuffle to ensure randomness
random.shuffle(subsets)
for subset in subsets:
# sort the subset numbers
ss = sorted(subset)
# calculate the differences between every two adjacent numbers
diffs = [j-i for i, j in zip(ss[:-1], ss[1:])]
if min(diffs) > mindiff:
sample = set(subset)
found = True
break
# check subsets with size -1
n -= 1
print(sample)
print(time()-t0)
Output:输出:
{2.3704888087015568, 4.365818049020534, 5.403474619948962, 6.518944556233767, 7.8388969285727015, 9.117993839791751}
4.182451486587524
However, in reality I have a list of 200 numbers, which is infeasible for a brute-froce enumeration.但是,实际上我有一个包含 200 个数字的列表,这对于粗暴枚举是不可行的。 I want a fast algorithm to sample just one random largest subset with a minimum difference larger than 1. Note that I want each sample has randomness and maximum size.
我想要一种快速算法来仅对一个最小差异大于 1 的随机最大子集进行采样。请注意,我希望每个样本都具有随机性和最大大小。 Any suggestions?
有什么建议?
My previous answer assumed you simply wanted a single optimal solution, not a uniform random sample of all solutions.我之前的回答假设您只想要一个最佳解决方案,而不是所有解决方案的统一随机样本。 This answer assumes you want one that samples uniformly from all such optimal solutions.
这个答案假设您想要一个从所有这些最佳解决方案中统一采样的答案。
Construct a directed acyclic graph G
where there is one node for each point, and nodes a
and b
are connected when b - a > mindist
.构造一个有向无环图
G
,其中每个点有一个节点,当b - a > mindist
时,节点a
和b
相连。 Also add two virtual nodes, s
and t
, where s -> x
for all x
and x -> t
for all x
.还要添加两个虚拟节点
s
和t
,其中s -> x
代表所有x
和x -> t
代表所有x
。
Calculate for each node in G
how many paths of length k
exist to t
.对于
G
每个节点,计算到t
存在多少条长度为k
路径。 You can do this efficiently in O(n^2 k)
time using dynamic programming with a table P[x][k]
, filling initially P[x][0] = 0
except P[t][0] = 1
, and then P[x][k] = sum(P[y][k-1] for y in neighbors(x))
.您可以使用带有表
P[x][k]
动态编程在O(n^2 k)
时间内有效地完成此操作,最初填充P[x][0] = 0
除了P[t][0] = 1
,然后P[x][k] = sum(P[y][k-1] for y in neighbors(x))
。
Keep doing this until you reach the maximum k
- you now know the size of the optimal subset.继续这样做,直到达到最大
k
- 您现在知道最佳子集的大小。
Uniformly sample a path of length k
from s
to t
using P
to weight your choices.使用
P
对从s
到t
的长度为k
的路径进行均匀采样以加权您的选择。
This is done by starting at s
.这是通过从
s
开始完成s
。 We then look at each neighbor of s
and choose one randomly with a weighting dictated by P[s][k]
.然后我们查看
s
每个邻居并随机选择一个,其权重由P[s][k]
。 This gives us our first element of the optimal set.这给了我们最优集合的第一个元素。
We then repeatedly perform this step.然后我们重复执行这一步。 We are at
x
, look at the neighbors of x
and pick one randomly using weights P[x][ki]
where i
is the step we're at.我们正处在
x
,看的邻居x
和选择一个随机使用权P[x][ki]
这里i
是步骤我们在。
Use the nodes you sampled in 3 as your random subset.使用您在 3 中采样的节点作为您的随机子集。
An implementation of the above in pure Python:在纯 Python 中实现上述内容:
import random
def sample_mindist_subset(xs, mindist):
# Construct directed graph G.
n = len(xs)
s = n; t = n + 1 # Two virtual nodes, source and sink.
neighbors = {
i: [t] + [j for j in range(n) if xs[j] - xs[i] > mindist]
for i in range(n)}
neighbors[s] = [t] + list(range(n))
neighbors[t] = []
# Compute number of paths P[x][k] from x to t of length k.
P = [[0 for _ in range(n+2)] for _ in range(n+2)]
P[t][0] = 1
for k in range(1, n+2):
for x in range(n+2):
P[x][k] = sum(P[y][k-1] for y in neighbors[x])
# Sample maximum length path uniformly at random.
maxk = max(k for k in range(n+2) if P[s][k] > 0)
path = [s]
while path[-1] != t:
candidates = neighbors[path[-1]]
weights = [P[cn][maxk-len(path)] for cn in candidates]
path.append(random.choices(candidates, weights)[0])
return [xs[i] for i in path[1:-1]]
Note that if you want to sample from the same set of numbers many times, you don't have to recompute P
every single time and can re-use it.请注意,如果您想从同一组数字中多次采样,则不必每次都重新计算
P
并且可以重复使用它。
I probably don't fully understand the question, because right now the solution is quite trivial.我可能不完全理解这个问题,因为现在解决方案非常简单。 EDIT: yes, I misunderstood after all, the OP does not just want an optimal solution, but wishes to randomly sample from the set of optimal solutions .
编辑:是的,毕竟我误解了,OP 不仅想要一个最佳解决方案,而且希望从一组最佳解决方案中随机抽样。 This answer is not incorrect but it also is an answer to a different question than what OP is interested in.
这个答案并没有错,但它也是对与 OP 感兴趣的问题不同的问题的答案。
Simply sort the numbers and greedily construct the subset:简单地对数字进行排序并贪婪地构造子集:
def mindist_subset(xs, mindist):
result = []
for x in sorted(xs):
if not result or x - result[-1] > mindist:
result.append(x)
return result
Sketch of proof of correctness.正确性证明草图。
Suppose we have a solution S
given input array A
that is of optimal size.假设我们有一个给定输入数组
A
的最优解S
If it does not contain min(A)
note that we could remove min(S)
from S
and add min(A)
since this would only increase the distance between min(S)
and the second smallest number in S
.如果它不包含
min(A)
需要注意的是,我们可以删除min(S)
从S
,并添加min(A)
因为这只会增加之间的距离min(S)
和第二小的号码S
。 Conclusion: we can without loss of generality assume that min(A)
is part of an optimal solution.结论:我们可以不失一般性地假设
min(A)
是最优解的一部分。
Now we can apply this argument recursively.现在我们可以递归地应用这个论点。 We add
min(A)
to a solution and remove all elements too close to min(A)
, giving remaining elements A'
.我们将
min(A)
添加到一个解决方案中,并删除所有离min(A)
太近的元素,给出剩余的元素A'
。 Then we're left with a subproblem where exactly the same argument applies, we can choose min(A')
as our next element of the solution, etc.然后我们剩下一个子问题,其中应用完全相同的参数,我们可以选择
min(A')
作为解决方案的下一个元素,等等。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.