简体   繁体   English

生成具有差异约束的随机整数

[英]Generating random integers with a difference constraint

I have the following problem: 我有以下问题:

Generate M uniformly random integers from the range 0-N, where N >> M, and where no pair has a difference less than K. where M >> K . 从0-N范围生成M个均匀随机整数,其中N >> M,并且其中没有对具有小于K的差。 其中M >> K.

At the moment the best method I can think of is to maintain a sorted list, then determine the lower bound of the current generated integer and test it with the lower and upper elements, if it's ok to then insert the element in between. 目前,我能想到的最好的方法是维护一个排序列表,然后确定当前生成的整数的下限,并使用下部和上部元素对其进行测试,如果可以,则在中间插入元素。 This is of complexity O(nlogn). 这是复杂的O(nlogn)。

Would there happen to be a more efficient algorithm? 会不会有更高效的算法?

An example of the problem: 问题的一个例子:

Generate 1000 uniformly random integers between zero and 100million where the difference between any two integers is no less than 1000 生成1000到100百万之间的1000个均匀随机整数,其中任意两个整数之间的差值不小于1000

A comprehensive way to solve this would be to: 解决这个问题的综合方法是:

  1. Determine all the combinations of n-choose-m that satisfy the constraint, lets called it set X 确定满足约束的n-choose-m的所有组合,让我们称之为设置X.
  2. Select a uniformly random integer i in the range [0,|X|). 选择[0,| X |)范围内的均匀随机整数i。
  3. Select the i'th combination from X as the result. 选择X中的第i个组合作为结果。

This solution is problematic when the n-choose-m is large, as enumerating and storing all possible combinations will be extremely costly. 当n-choose-m很大时,这种解决方案是有问题的,因为枚举和存储所有可能的组合将是非常昂贵的。 Hence an efficient online generating solution is sought. 因此,寻求有效的在线生成解决方案。

Note: The following is a C++ implementation of the solution provided by pentadecagon 注意:以下是pentadecagon提供的解决方案的C ++实现

std::vector<int> generate_random(const int n, const int m, const int k)
{
   if ((n < m) || (m < k))
      return std::vector<int>();

   std::random_device source;
   std::mt19937 generator(source());
   std::uniform_int_distribution<> distribution(0, n - (m - 1) * k);

   std::vector<int> result_list;
   result_list.reserve(m);

   for (int i = 0; i < m; ++i)
   {
      result_list.push_back(distribution(generator));
   }

   std::sort(std::begin(result_list),std::end(result_list));

   for (int i = 0; i < m; ++i)
   {
      result_list[i] += (i * k);
   }

   return result_list;
}

http://ideone.com/KOeR4R http://ideone.com/KOeR4R

.

EDIT: I adapted the text for the requirement to create ordered sequences, each with the same probability. 编辑:我修改了文本,以创建有序序列的要求,每个序列具有相同的概率。

Create random numbers a_i for i=0..M-1 without duplicates. i=0..M-1创建随机数a_i ,不重复。 Sort them. 排序他们。 Then create numbers 然后创建数字

b_i=a_i + i*(K-1)

Given the construction, those numbers b_i have the required gaps, because the a_i already have gaps of at least 1 . 鉴于构造,这些数字b_i具有所需的间隙,因为a_i已经具有至少1间隙。 In order to make sure those b values cover exactly the required range [1..N] , you must ensure a_i are picked from a range [1..N-(M-1)*(K-1)] . 为了确保这些b值完全覆盖所需的范围[1..N] ,必须确保从范围[1..N-(M-1)*(K-1)]中选择a_i This way you get truly independent numbers. 这样你就可以得到真正独立的数字。 Well, as independent as possible given the required gap. 那么,考虑到所需的差距,尽可能独立。 Because of the sorting you get O(M log M) performance again, but this shouldn't be too bad. 由于排序,你再次获得O(M log M)性能,但这不应该太糟糕。 Sorting is typically very fast. 排序通常非常快。 In Python it looks like this: 在Python中它看起来像这样:

import random
def random_list( N, M, K ):
    s = set()
    while len(s) < M:
        s.add( random.randint( 1, N-(M-1)*(K-1) ) )

    res = sorted( s )

    for i in range(M):
        res[i] += i * (K-1)

    return res

First off: this will be an attempt to show that there's a bijection between the (M+1) - compositions (with the slight modification that we will allow addends to be 0 ) of the value N - (M-1)*K and the valid solutions to your problem. 首先:这将试图表明在(M+1) - 组合物 (稍微修改我们将允许加数为0 )之间的双重值N - (M-1)*K和解决问题的有效方法。 After that, we only have to pick one of those compositions uniformly at random and apply the bijection. 之后,我们只需要随机均匀地选择其中一种成分并应用双射。


Bijection: 双射:

Let

M + 1  - 组成

Then the x i form an M+1 -composition (with 0 addends allowed) of the value on the left (notice that the x i do not have to be monotonically increasing!). 然后x i形成左边值的M+1 (允许0加数)(注意x i不必单调递增!)。

From this we get a valid solution 从中我们得到了有效的解决方案

解决方案集

by setting the values m i as follows: 通过设置值m i如下:

施工组成解决方案

We see that the distance between m i and m i + 1 is at least K , and m M is at most N (compare the choice of the composition we started out with). 我们看到m i和m i + 1之间的距离至少为K ,m M最多为N (比较我们开始的组合物的选择)。 This means that every (M+1) -composition that fulfills the conditions above defines exactly one valid solution to your problem. 这意味着满足上述条件的每个(M+1)只能为您的问题定义一个有效的解决方案。 (You'll notice that we only use the x M as a way to make the sum turn out right, we don't use it for the construction of the m i .) (你会注意到我们只使用x M作为一种方法使得总和正确,我们不会用它来构造m i 。)

To see that this gives a bijection, we need to see that the construction can be reversed; 为了看到这给出了一个双射,我们需要看到构造可以逆转; for this purpose, let 为此,让

解决方案集

be a given solution fulfilling your conditions. 是满足您条件的特定解决方案。 To get the composition this is constructed from, define the x i as follows: 为了得到这个构造,定义x i如下:

建筑解决方案的组成

Now first, all x i are at least 0 , so that's alright. 首先,所有x i至少为0 ,所以没关系。 To see that they form a valid composition (again, every x i is allowed to be 0 ) of the value given above, consider: 要看到它们形成上面给出的值的有效合成(同样,每个x i允许为0 ),请考虑:

在此输入图像描述

The third equality follows since we have this telescoping sum that cancels out almost all m i . 第三个相等,因为我们有这个伸缩的总和几乎取消了所有m i

So we've seen that the described construction gives a bijection between the described compositions of N - (M-1)*K and the valid solutions to your problem. 所以我们已经看到所描述的结构在所描述的N - (M-1)*K与你的问题的有效解决方案之间给出了双射。 All we have to do now is pick one of those compositions uniformly at random and apply the construction to get a solution. 我们现在要做的就是随机选择其中一种成分并应用结构来获得解决方案。


Picking a composition uniformly at random 随机均匀地采摘组合物

Each of the described compositions can be uniquely identified in the following way (compare this for illustration): reserve N - (M-1)*K spaces for the unary notation of that value, and another M spaces for M commas. 可以通过以下方式唯一地识别所描述的每个组合物(比较这个用于说明):为该值的一元表示法保留N - (M-1)*K空格,并且对于M逗号保留另一个M空格。 We get an (M+1) - composition of N - (M-1)*K by choosing M of the N - (M-1)*K + M spaces, putting commas there, and filling the rest with | 我们得到一个(M+1) -组成N - (M-1)*K通过选择M的的N - (M-1)*K + M的空间,把逗号出现,并与填充其余| . Then let x 0 be the number of | 然后让x 0|的数量 before the first comma, x M+1 the number of | 在第一个逗号之前,x M + 1的数量为| after the last comma, and all other x i the number of | 在最后一个逗号之后,所有其他x i的数字为| between commas i and i+1 . 逗号ii+1 So all we have to do is pick an M -element subset of the integer interval [1; N - (M-1)*K + M] 所以我们要做的就是选择整数区间的M元素子集[1; N - (M-1)*K + M] [1; N - (M-1)*K + M] uniformly at random, which we can do for example with the Fisher-Yates shuffle in O(N + M log M) (we need to sort the M delimiters to build the composition) since M*K needs to be in O(N) for any solutions to exist. [1; N - (M-1)*K + M]是随机均匀的,我们可以做例如在O(N + M log M)使用Fisher-Yates shuffle(我们需要对M分隔符进行排序以构建组合物)因为M*K需要在O(N)中才能存在任何解。 So if N is bigger than M by at least a logarithmic factor, then this is linear in N . 因此,如果NM大至少一个对数因子,那么这在N是线性的。


Note: @DavidEisenstat suggested that there are more space efficient ways of picking the M -element subset of that interval; 注意:@DavidEisenstat建议有更多节省空间的方法来选择该区间的M元素子集; I'm not aware of any, I'm afraid. 我不知道,我不知道。


You can get an error-proof algorithm out of this by doing the simple input validation we get from the construction above that N ≥ (M-1) * K and that all three values are at least 1 (or 0 , if you define the empty set as a valid solution for that case). 你可以通过我们从N ≥ (M-1) * K上面的构造得到的简单输入验证得到一个防错算法,并且所有三个值至少为1 (或0 ,如果你定义了空集作为该案例的有效解决方案)。

Why not do this: 为什么不这样做:

for (int i = 0; i < M; ++i) {
  pick a random number between K and N/M
  add this number to (N/M)* i;

Now you have M random numbers, distributed evenly along N, all of which have a difference of at least K. It's in O(n) time. 现在你有M个随机数,沿N均匀分布,所有这些都至少有K.它在O(n)时间内。 As an added bonus, it's already sorted. 作为一个额外的奖励,它已经排序。 :-) :-)

EDIT: 编辑:

Actually, the "pick a random number" part shouldn't be between K and N/M, but between min(K, [K - (N/M * i - previous value)]) . 实际上,“选择一个随机数”部分不应该在K和N / M之间,而应该在min(K, [K - (N/M * i - previous value)]) That would ensure that the differences are still at least K, and not exclude values that should not be missed. 这将确保差异仍然至少为K,并且不排除不应错过的值。

Second EDIT: 第二次编辑:

Well, the first case shouldn't be between K and N/M - it should be between 0 and N/M. 那么,第一种情况不应该在K和N / M之间 - 它应该在0和N / M之间。 Just like you need special casing for when you get close to the N/M*i border, we need special initial casing. 就像你需要特殊的外壳,当你接近N / M * i边界时,我们需要特殊的初始套管。

Aside from that, the issue you brought up in your comments was fair representation, and you're right. 除此之外,你在评论中提出的问题是公平的代表,你是对的。 As my pseudocode is presented, it currently completely misses the excess between N/M*M and N. It's another edge case; 当我的伪代码出现时,它目前完全错过了N / M * M和N之间的过剩。这是另一个边缘情况; simply change the random values of your last range. 只需更改上一个范围的随机值即可。

Now, in this case, your distribution will be different for the last range. 现在,在这种情况下,您的分布将在最后一个范围内有所不同。 Since you have more numbers, you have slightly less chance for each number than you do for all the other ranges. 由于您拥有更多数字,因此每个数字的可能性略小于所有其他范围。 My understanding is that because you're using ">>", this shouldn't really impact the distribution, ie the difference in size in the sample set should be nominal. 我的理解是,因为你使用“>>”,这不应该真正影响分布,即样本集中的大小差异应该是名义上的。 But if you want to make it more fair, you divide the excess equally among each range. 但是如果你想让它更公平,你可以在每个范围内平均分配多余的东西。 This makes your initial range calculation more complex - you'll have to augment each range based on how much remainder there is divided by M. 这使得您的初始范围计算更加复杂 - 您必须根据剩余的余数除以M来增加每个范围。

There are lots of special cases to look out for, but they're all able to be handled. 有许多特殊情况需要注意,但它们都能够得到处理。 I kept the pseudocode very basic just to make sure that the general concept came through clearly. 我保持伪代码非常基本只是为了确保一般概念清楚地通过。 If nothing else, it should be a good starting point. 如果不出意外,它应该是一个很好的起点。

Third and Final EDIT: 第三次和最后一次编辑:

For those worried that the distribution has a forced evenness, I still claim that there's nothing saying it can't. 对于那些担心分配有强迫性的人,我仍然声称没有什么可以说它不能。 The selection is uniformly distributed in each segment. 选择在每个段中均匀分布。 There is a linear way to keep it uneven, but that also has a trade-off: if one value is selected extremely high (which should be unlikely given a very large N), then all the other values are constrained: 有一种线性的方法来保持它不均匀,但这也有一个权衡:如果选择一个非常高的值(这应该是非常大的N),那么所有其他值都受到约束:

int prevValue = 0;
int maxRange;
for (int i = 0; i < M; ++i) {
    maxRange = N - (((M - 1) - i) * K) - prevValue;
    int nextValue = random(0, maxRange);
    prevValue += nextValue;
    store previous value;
    prevValue += K;
}

This is still linear and random and allows unevenness, but the bigger prevValue gets, the more constrained the other numbers become. 这仍然是线性和随机的并且允许不均匀,但是更大的prevValue得到,其他数字变得越受限制。 Personally, I prefer my second edit answer, but this is an available option that given a large enough N is likely to satisfy all the posted requirements. 就个人而言,我更喜欢我的第二个编辑答案,但这是一个可用的选项,给定足够大的N可能满足所有发布的要求。

Come to think of it, here's one other idea. 想想看,这是另一个想法。 It requires a lot more data maintenance, but is still O(M) and is probably the most fair distribution: 它需要更多的数据维护,但仍然是O(M),可能是最公平的分布:

What you need to do is maintain a vector of your valid data ranges and a vector of probability scales. 您需要做的是保持有效数据范围的向量和概率尺度的向量。 A valid data range is just the list of high-low values where K is still valid. 有效数据范围只是K仍然有效的高 - 低值列表。 The idea is you first use the scaled probability to pick a random data range, then you randomly pick a value within that range. 您的想法是首先使用缩放概率来选择随机数据范围,然后随机选择该范围内的值。 You remove the old valid data range and replace it with 0, 1 or 2 new data ranges in the same position, depending on how many are still valid. 您删除旧的有效数据范围,并将其替换为相同位置的0,1或2个新数据范围,具体取决于仍有效的数据范围。 All of these actions are constant time other than handling the weighted probability, which is O(M), done in a loop M times, so the total should be O(M^2), which should be much better than O(NlogN) because N >> M. 所有这些动作都是恒定时间,而不是处理加权概率,即O(M),在循环中完成M次,所以总数应该是O(M ^ 2),这应该比O(NlogN)好得多因为N >> M.

Rather than pseudocode, let me work an example using OP's original example: 而不是伪代码,让我使用OP的原始示例来做一个例子:

  • 0th iteration: valid data ranges are from [0...100Mill], and the weight for this range is 1.0. 第0次迭代:有效数据范围为[0 ... 100Mill],该范围的权重为1.0。
  • 1st iteration: Randomly pick one element in the one element vector, then randomly pick one element in that range. 第一次迭代:随机选择一个元素向量中的一个元素,然后随机选择该范围内的一个元素。
    • If the element is, eg 12345678, then we remove the [0...100Mill] and replace it with [0...12344678] and [12346678...100Mill] 如果元素是,例如12345678,那么我们删除[0 ... 100Mill]并将其替换为[0 ... 12344678]和[12346678 ... 100Mill]
    • If the element is, eg 500, then we remove the [0...100Mill] and replace it with just [1500...100Mill], since [0...500] is no longer a valid range. 如果元素是,例如500,那么我们删除[0 ... 100Mill]并用[1500 ... 100Mill]替换它,因为[0 ... 500]不再是有效范围。 The only time we will replace it with 0 ranges is in the unlikely event that you have a range with only one number in it and it gets picked. 我们唯一一次用0范围替换它的情况是,你有一个只有一个数字的范围并且它被选中。 (In that case, you'll have 3 numbers in a row that are exactly K apart from each other.) (在这种情况下,你将连续3个数字彼此完全相隔K.)
    • The weight for the ranges are their length over the total length, eg 12344678/(12344678 + (100Mill - 12346678)) and (100Mill - 12346678)/(12344678 + (100Mill - 12346678)) 范围的权重是它们在总长度上的长度,例如12344678 /(12344678 +(100Mill - 12346678))和(100Mill - 12346678)/(12344678 +(100Mill - 12346678))

In the next iterations, you do the same thing: randomly pick a number between 0 and 1 and determine which of the ranges that scale falls into. 在接下来的迭代中,您执行相同的操作:随机选择0到1之间的数字,并确定缩放范围中的哪个范围。 Then randomly pick a number in that range, and replace your ranges and scales. 然后随机选择该范围内的数字,并替换您的范围和比例。

By the time it's done, we're no longer acting in O(M), but we're still only dependent on the time of M instead of N. And this actually is both uniform and fair distribution. 到它完成时,我们不再在O(M)中行动,但我们仍然只依赖于M而不是N的时间。这实际上是统一和公平的分配。

Hope one of these ideas works for you! 希望其中一个想法适合你!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM