简体   繁体   中英

Generating random integers with a difference constraint

I have the following problem:

Generate M uniformly random integers from the range 0-N, where N >> M, and where no pair has a difference less than K. where M >> K .

At the moment the best method I can think of is to maintain a sorted list, then determine the lower bound of the current generated integer and test it with the lower and upper elements, if it's ok to then insert the element in between. This is of complexity O(nlogn).

Would there happen to be a more efficient algorithm?

An example of the problem:

Generate 1000 uniformly random integers between zero and 100million where the difference between any two integers is no less than 1000

A comprehensive way to solve this would be to:

  1. Determine all the combinations of n-choose-m that satisfy the constraint, lets called it set X
  2. Select a uniformly random integer i in the range [0,|X|).
  3. Select the i'th combination from X as the result.

This solution is problematic when the n-choose-m is large, as enumerating and storing all possible combinations will be extremely costly. Hence an efficient online generating solution is sought.

Note: The following is a C++ implementation of the solution provided by pentadecagon

std::vector<int> generate_random(const int n, const int m, const int k)
{
   if ((n < m) || (m < k))
      return std::vector<int>();

   std::random_device source;
   std::mt19937 generator(source());
   std::uniform_int_distribution<> distribution(0, n - (m - 1) * k);

   std::vector<int> result_list;
   result_list.reserve(m);

   for (int i = 0; i < m; ++i)
   {
      result_list.push_back(distribution(generator));
   }

   std::sort(std::begin(result_list),std::end(result_list));

   for (int i = 0; i < m; ++i)
   {
      result_list[i] += (i * k);
   }

   return result_list;
}

http://ideone.com/KOeR4R

.

EDIT: I adapted the text for the requirement to create ordered sequences, each with the same probability.

Create random numbers a_i for i=0..M-1 without duplicates. Sort them. Then create numbers

b_i=a_i + i*(K-1)

Given the construction, those numbers b_i have the required gaps, because the a_i already have gaps of at least 1 . In order to make sure those b values cover exactly the required range [1..N] , you must ensure a_i are picked from a range [1..N-(M-1)*(K-1)] . This way you get truly independent numbers. Well, as independent as possible given the required gap. Because of the sorting you get O(M log M) performance again, but this shouldn't be too bad. Sorting is typically very fast. In Python it looks like this:

import random
def random_list( N, M, K ):
    s = set()
    while len(s) < M:
        s.add( random.randint( 1, N-(M-1)*(K-1) ) )

    res = sorted( s )

    for i in range(M):
        res[i] += i * (K-1)

    return res

First off: this will be an attempt to show that there's a bijection between the (M+1) - compositions (with the slight modification that we will allow addends to be 0 ) of the value N - (M-1)*K and the valid solutions to your problem. After that, we only have to pick one of those compositions uniformly at random and apply the bijection.


Bijection:

Let

M + 1  - 组成

Then the x i form an M+1 -composition (with 0 addends allowed) of the value on the left (notice that the x i do not have to be monotonically increasing!).

From this we get a valid solution

解决方案集

by setting the values m i as follows:

施工组成解决方案

We see that the distance between m i and m i + 1 is at least K , and m M is at most N (compare the choice of the composition we started out with). This means that every (M+1) -composition that fulfills the conditions above defines exactly one valid solution to your problem. (You'll notice that we only use the x M as a way to make the sum turn out right, we don't use it for the construction of the m i .)

To see that this gives a bijection, we need to see that the construction can be reversed; for this purpose, let

解决方案集

be a given solution fulfilling your conditions. To get the composition this is constructed from, define the x i as follows:

建筑解决方案的组成

Now first, all x i are at least 0 , so that's alright. To see that they form a valid composition (again, every x i is allowed to be 0 ) of the value given above, consider:

在此输入图像描述

The third equality follows since we have this telescoping sum that cancels out almost all m i .

So we've seen that the described construction gives a bijection between the described compositions of N - (M-1)*K and the valid solutions to your problem. All we have to do now is pick one of those compositions uniformly at random and apply the construction to get a solution.


Picking a composition uniformly at random

Each of the described compositions can be uniquely identified in the following way (compare this for illustration): reserve N - (M-1)*K spaces for the unary notation of that value, and another M spaces for M commas. We get an (M+1) - composition of N - (M-1)*K by choosing M of the N - (M-1)*K + M spaces, putting commas there, and filling the rest with | . Then let x 0 be the number of | before the first comma, x M+1 the number of | after the last comma, and all other x i the number of | between commas i and i+1 . So all we have to do is pick an M -element subset of the integer interval [1; N - (M-1)*K + M] [1; N - (M-1)*K + M] uniformly at random, which we can do for example with the Fisher-Yates shuffle in O(N + M log M) (we need to sort the M delimiters to build the composition) since M*K needs to be in O(N) for any solutions to exist. So if N is bigger than M by at least a logarithmic factor, then this is linear in N .


Note: @DavidEisenstat suggested that there are more space efficient ways of picking the M -element subset of that interval; I'm not aware of any, I'm afraid.


You can get an error-proof algorithm out of this by doing the simple input validation we get from the construction above that N ≥ (M-1) * K and that all three values are at least 1 (or 0 , if you define the empty set as a valid solution for that case).

Why not do this:

for (int i = 0; i < M; ++i) {
  pick a random number between K and N/M
  add this number to (N/M)* i;

Now you have M random numbers, distributed evenly along N, all of which have a difference of at least K. It's in O(n) time. As an added bonus, it's already sorted. :-)

EDIT:

Actually, the "pick a random number" part shouldn't be between K and N/M, but between min(K, [K - (N/M * i - previous value)]) . That would ensure that the differences are still at least K, and not exclude values that should not be missed.

Second EDIT:

Well, the first case shouldn't be between K and N/M - it should be between 0 and N/M. Just like you need special casing for when you get close to the N/M*i border, we need special initial casing.

Aside from that, the issue you brought up in your comments was fair representation, and you're right. As my pseudocode is presented, it currently completely misses the excess between N/M*M and N. It's another edge case; simply change the random values of your last range.

Now, in this case, your distribution will be different for the last range. Since you have more numbers, you have slightly less chance for each number than you do for all the other ranges. My understanding is that because you're using ">>", this shouldn't really impact the distribution, ie the difference in size in the sample set should be nominal. But if you want to make it more fair, you divide the excess equally among each range. This makes your initial range calculation more complex - you'll have to augment each range based on how much remainder there is divided by M.

There are lots of special cases to look out for, but they're all able to be handled. I kept the pseudocode very basic just to make sure that the general concept came through clearly. If nothing else, it should be a good starting point.

Third and Final EDIT:

For those worried that the distribution has a forced evenness, I still claim that there's nothing saying it can't. The selection is uniformly distributed in each segment. There is a linear way to keep it uneven, but that also has a trade-off: if one value is selected extremely high (which should be unlikely given a very large N), then all the other values are constrained:

int prevValue = 0;
int maxRange;
for (int i = 0; i < M; ++i) {
    maxRange = N - (((M - 1) - i) * K) - prevValue;
    int nextValue = random(0, maxRange);
    prevValue += nextValue;
    store previous value;
    prevValue += K;
}

This is still linear and random and allows unevenness, but the bigger prevValue gets, the more constrained the other numbers become. Personally, I prefer my second edit answer, but this is an available option that given a large enough N is likely to satisfy all the posted requirements.

Come to think of it, here's one other idea. It requires a lot more data maintenance, but is still O(M) and is probably the most fair distribution:

What you need to do is maintain a vector of your valid data ranges and a vector of probability scales. A valid data range is just the list of high-low values where K is still valid. The idea is you first use the scaled probability to pick a random data range, then you randomly pick a value within that range. You remove the old valid data range and replace it with 0, 1 or 2 new data ranges in the same position, depending on how many are still valid. All of these actions are constant time other than handling the weighted probability, which is O(M), done in a loop M times, so the total should be O(M^2), which should be much better than O(NlogN) because N >> M.

Rather than pseudocode, let me work an example using OP's original example:

  • 0th iteration: valid data ranges are from [0...100Mill], and the weight for this range is 1.0.
  • 1st iteration: Randomly pick one element in the one element vector, then randomly pick one element in that range.
    • If the element is, eg 12345678, then we remove the [0...100Mill] and replace it with [0...12344678] and [12346678...100Mill]
    • If the element is, eg 500, then we remove the [0...100Mill] and replace it with just [1500...100Mill], since [0...500] is no longer a valid range. The only time we will replace it with 0 ranges is in the unlikely event that you have a range with only one number in it and it gets picked. (In that case, you'll have 3 numbers in a row that are exactly K apart from each other.)
    • The weight for the ranges are their length over the total length, eg 12344678/(12344678 + (100Mill - 12346678)) and (100Mill - 12346678)/(12344678 + (100Mill - 12346678))

In the next iterations, you do the same thing: randomly pick a number between 0 and 1 and determine which of the ranges that scale falls into. Then randomly pick a number in that range, and replace your ranges and scales.

By the time it's done, we're no longer acting in O(M), but we're still only dependent on the time of M instead of N. And this actually is both uniform and fair distribution.

Hope one of these ideas works for you!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM