简体   繁体   English

如何有效地生成一组具有预定义分布的唯一随机数?

[英]How to efficiently generate a set of unique random numbers with a predefined distribution?

I have a map of items with some probability distribution: 我有一个具有一些概率分布的项目图:

Map<SingleObjectiveItem, Double> itemsDistribution;

Given a certain m I have to generate a Set of m elements sampled from the above distribution. 给定一定的m我必须生成一Set从上述分布中采样的m元素。

As of now I was using the naive way of doing it: 截至目前,我正在使用天真的方式:

while(mySet.size < m)
   mySet.add(getNextSample(itemsDistribution));

The getNextSample(...) method fetches an object from the distribution as per its probability. getNextSample(...)方法根据概率从分布中提取对象。 Now, as m increases the performance severely suffers. 现在,随着m增加,性能严重受损。 For m = 500 and itemsDistribution.size() = 1000 elements, there is too much thrashing and the function remains in the while loop for too long. 对于m = 500itemsDistribution.size() = 1000元素,有太多的颠簸,并且函数在while循环中保留太长时间。 Generate 1000 such sets and you have an application that crawls. 生成1000个这样的集合,并且您有一个可以爬行的应用程序。

Is there a more efficient way to generate a unique set of random numbers with a "predefined" distribution? 是否有更有效的方法来生成具有“预定义”分布的唯一随机数集? Most collection shuffling techniques and the like are uniformly random. 大多数收集改组技术等是均匀随机的。 What would be a good way to address this? 什么是解决这个问题的好方法?

UPDATE : The loop will call getNextSample(...) "at least" 1 + 2 + 3 + ... + m = m(m+1)/2 times. 更新 :循环将调用getNextSample(...) “至少” 1 + 2 + 3 + ... + m = m(m+1)/2次。 That is in the first run we'll definitely get a sample for the set. 那是在第一次运行中我们肯定会得到一组样本。 The 2nd iteration, it may be called at least twice and so on. 第二次迭代,它可能被调用至少两次,依此类推。 If getNextSample is sequential in nature, ie, goes through the entire cumulative distribution to find the sample, then the run time complexity of the loop is at least: n*m(m+1)/2 , 'n' is the number of elements in the distribution. 如果getNextSample本质上是顺序的,即遍历整个累积分布以找到样本,则循环的运行时复杂度至少为: n*m(m+1)/2 ,'n'是数字分布中的元素。 If m = cn; 0<c<=1 如果m = cn; 0<c<=1 m = cn; 0<c<=1 then the loop is at least Sigma(n^3). m = cn; 0<c<=1然后循环至少为Sigma(n ^ 3)。 And that too is the lower bound! 这也是下限!

If we replace sequential search by binary search, the complexity would be at least Sigma(log n * n^2). 如果我们通过二分搜索替换顺序搜索,则复杂性至少为Sigma(log n * n ^ 2)。 Efficient but may not be by a large margin. 效率高但可能不是很大。

Also, removing from the distribution is not possible since I call the above loop k times, to generate k such sets. 此外,由于我将上述循环调用了k次,因此无法从分布中删除,以生成k这样的集合。 These sets are part of a randomized 'schedule' of items. 这些集合是项目的随机“计划”的一部分。 Hence a 'set' of items. 因此,一组“项目”。

Start out by generating a number of random points in two dimentions. 首先在两个维度中生成一些随机点。

在此输入图像描述

Then apply your distribution 然后应用您的发行版

在此输入图像描述

Now find all entries within the distribution and pick the x coordinates, and you have your random numbers with the requested distribution like this: 现在找到分布中的所有条目并选择x坐标,并且您的随机数字具有所请求的分布,如下所示:

在此输入图像描述

The problem is unlikely to be the loop you show: 这个问题不太可能是你展示的循环:

Let n be the size of the distribution, and I be the number of invocations to getNextSample. 设n是分布的大小,我是getNextSample的调用次数。 We have I = sum_i(C_i), where C_i is the number of invocations to getNextSample while the set has size i. 我们有I = sum_i(C_i),其中C_i是getNextSample的调用次数,而集合的大小为i。 To find E[C_i], observe that C_i is the inter-arrival time of a poisson process with λ = 1 - i / n, and therefore exponentially distributed with λ. 为了找到E [C_i],观察到C_i是泊松过程的到达间时间,其中λ= 1-i / n,因此以λ 指数分布 Therefore, E[C_i] = 1 / λ = therefore E[C_i] = 1 / (1 - i / n) <= 1 / (1 - m / n). 因此,E [C_i] = 1 /λ=因此E [C_i] = 1 /(1-i / n)<= 1 /(1-m / n)。 Therefore, E[I] < m / (1 - m / n). 因此,E [I] <m /(1-m / n)。

That is, sampling a set of size m = n/2 will take, on average, less than 2m = n invocations of getNextSample. 也就是说,对一组大小m = n / 2进行采样平均需要小于2m = n次调用getNextSample。 If that is "slow" and "crawls", it is likely because getNextSample is slow. 如果那是“慢”和“爬行”,可能是因为getNextSample很慢。 This is actually unsurprising, given the unsuitable way the distrubution is passed to the method (because the method will, of necessity, have to iterate over the entire distribution to find a random element). 这实际上并不令人惊讶,因为分配传递给方法的方式不合适(因为该方法必须迭代整个分布以找到随机元素)。

The following should be faster (if m < 0.8 n) 以下应该更快(如果m <0.8 n)

class Distribution<T> {
    private double[] cummulativeWeight;
    private T[] item;
    private double totalWeight;

    Distribution(Map<T, Double> probabilityMap) {
        int i = 0;

        cummulativeWeight = new double[probabilityMap.size()];
        item = (T[]) new Object[probabilityMap.size()];

        for (Map.Entry<T, Double> entry : probabilityMap.entrySet()) {
            item[i] = entry.getKey();
            totalWeight += entry.getValue();
            cummulativeWeight[i] = totalWeight;
            i++;
        }
    }

    T randomItem() {
        double weight = Math.random() * totalWeight;
        int index = Arrays.binarySearch(cummulativeWeight, weight);
        if (index < 0) {
            index = -index - 1;
        }
        return item[index];
    }

    Set<T> randomSubset(int size) {
        Set<T> set = new HashSet<>();
        while(set.size() < size) {
            set.add(randomItem());
        }
        return set;
    }
}



public class Test {

    public static void main(String[] args) {
        int max = 1_000_000;
        HashMap<Integer, Double> probabilities = new HashMap<>();
        for (int i = 0; i < max; i++) {
            probabilities.put(i, (double) i);
        }

        Distribution<Integer> d = new Distribution<>(probabilities);
        Set<Integer> set = d.randomSubset(max / 2);
        //System.out.println(set);
    }
}

The expected runtime is O(m / (1 - m / n) * log n). 预期的运行时间是O(m /(1-m / n)* log n)。 On my computer, a subset of size 500_000 of a set of 1_000_000 is computed in about 3 seconds. 在我的计算机上,在大约3秒内计算出一组1_000_000的大小为500_000的子集。

As we can see, the expected runtime approaches infinity as m approaches n. 正如我们所看到的,当m接近n时,预期的运行时接近无穷大。 If that is a problem (ie m > 0.9 n), the following more complex approach should work better: 如果这是一个问题(即m> 0.9 n),以下更复杂的方法应该更好:

Set<T> randomSubset(int size) {
    Set<T> set = new HashSet<>();
    while(set.size() < size) {
        T randomItem = randomItem();
            remove(randomItem); // removes the item from the distribution
            set.add(randomItem);
    }
    return set;
}

To efficiently implement remove requires a different representation for the distribution, for instance a binary tree where each node stores the total weight of the subtree whose root it is. 为了有效地实现删除,需要不同的分布表示,例如二叉树,其中每个节点存储其根的子树的总权重。

But that is rather complicated, so I wouldn't go that route if m is known to be significantly smaller than n. 但这是相当复杂的,所以如果已知m明显小于n,我就不会走那条路。

You should implement your own random number generator (using a MonteCarlo methode or any good uniform generator like mersen twister) and basing on the inversion method ( here ). 您应该实现自己的随机数生成器(使用MonteCarlo方法或任何良好的统一生成器,如meson twister)并基于反演方法( 此处 )。

For example : exponential law: generate a uniform random number u in [0,1] then your random variable of the exponential law would be : ln(1-u)/(-lambda) lambda being the exponential law parameter and ln the natural logarithm . 例如:指数定律:在[0,1]生成一个统一的随机数u然后你的指数定律的随机变量将是: ln(1-u)/(-lambda) lambda being the exponential law parameter and ln the natural logarithm

Hope it'll help ;). 希望它会有所帮助;)。

If you are not concerning with randomness properties too much then I do it like this: 如果你不太关心随机性属性,那么我这样做:

  1. create buffer for pseudo-random numbers 为伪随机数创建缓冲区

    double buff[MAX]; 双buff [MAX]; // [edit1] double pseudo random numbers // [edit1]双伪随机数

    • MAX is size should be big enough ... 1024*128 for example MAX的尺寸应该足够大......例如1024 * 128
    • type can be any ( float,int,DWORD ...) type可以是any( float,int,DWORD ......)
  2. fill buffer with numbers 用数字填充缓冲区

    you have range of numbers x = < x0,x1 > and probability function probability(x) defined by your probability distribution so do this: 你有一个数字范围x = < x0,x1 >和你的概率分布定义的概率函数probability(x) ,所以这样做:

     for (i=0,x=x0;x<=x1;x+=stepx) for (j=0,n=probability(x)*MAX,q=0.1*stepx/n;j<n;j++,i++) // [edit1] unique pseudo-random numbers buff[i]=x+(double(i)*q); // [edit1] ... 

    The stepx is your accuracy for items (for integral types = 1) now the buff[] array has the same distribution as you need but it is not pseudo-random. stepx是你对项目的准确性(对于整数类型= 1),现在buff[]数组具有你需要的相同分布,但它不是伪随机的。 Also you should add check if j is not >= MAX to avoid array overruns and also at the end the real size of buff[] is j (can be less than MAX due to rounding) 另外你应该添加检查j是否不是>= MAX以避免数组溢出,并且最后buff[]的实际大小为j (由于舍入可能小于MAX)

  3. shuffle buff[] shuffle buff[]

    do just few loops of swap buff[i] and buff[j] where i is the loop variable and j is pseudo-random <0-MAX) 做几个交换buff[i]buff[j]的循环,其中i是循环变量, j是伪随机<0-MAX)

  4. write your pseudo-random function 写你的伪随机函数

    it just return number from the buffer. 它只是从缓冲区返回数字。 At first call returns the buff[0] at second buff[1] and so on ... For standard generators When you hit the end of buff[] then shuffle buff[] again and start from buff[0] again. 在第一次调用时,在第二个buff[1]返回buff[0] ,依此类推...对于标准生成器当你点击buff[]结束时,再次重新buff[]并再次从buff [0]开始。 But as you need unique numbers then you can not reach the end of buffer so so set MAX to be big enough for your task otherwise uniqueness will not be assured. 但是,由于您需要唯一的数字,因此您无法达到缓冲区的末尾,因此将MAX设置为足以满足您的任务要求,否则无法确保唯一性。

[Notes] [笔记]

MAX should be big enough to store the whole distribution you want. MAX应足够大,以存储您想要的整个发行版。 If it is not big enough then items with low probability can be missing completely. 如果它不够大,那么概率很低的物品可能会完全丢失。

[edit1] - tweaked answer a little to match the question needs (pointed by meriton thanks) [edit1] - 调整回答一点以匹配问题需求(由meriton感谢指出)

PS. PS。 complexity of initialization is O(N) and for get number is O(1) . 初始化的复杂度是O(N) ,而get数是O(1)

I think you have two problems: 我认为你有两个问题:

  1. Your itemDistribution doesn't know you need a set, so when the set you are building gets large you will pick a lot of elements that are already in the set. 您的itemDistribution不知道您需要一个集合,因此当您构建的集合变大时,您将选择已经在集合中的许多元素。 If you start with the set all full and remove elements you will run into the same problem for very small sets. 如果你从set all full和remove元素开始,那么对于非常小的集合,你会遇到同样的问题。

    Is there a reason why you don't remove the element from the itemDistribution after you picked it? 您选择它后,是否有理由不从itemDistribution删除该元素? Then you wouldn't pick the same element twice? 那么你不会两次选择相同的元素?

  2. The choice of datastructure for itemDistribution looks suspicious to me. itemDistribution的数据结构选择看起来很可疑。 You want the getNextSample operation to be fast. 您希望getNextSample操作快速。 Doesn't the map from values to probability force you to iterate through large parts of the map for each getNextSample . 从值到概率的地图不会强制您为每个getNextSample迭代地图的大部分内容。 I'm no good at statistics but couldn't you represent the itemDistribution the other way, like a map from probability, or maybe the sum of all smaller probabilities + probability to a element of the set? 我不擅长统计数据,但你itemDistribution用另一种方式表示itemDistribution ,比如概率图,或者是所有较小概率的总和+概率与集合元素的概率?

Your performance depends on how your getNextSample function works. 您的性能取决于getNextSample函数的工作方式。 If you have to iterate over all probabilities when you pick the next item, it might be slow. 如果在选择下一个项目时必须迭代所有概率,则可能会很慢。

A good way to pick several unique random items from a list is to first shuffle the list and then pop items off the list. 从列表中选择几个唯一随机项的好方法是首先对列表进行随机播放,然后从列表中弹出项。 You can shuffle the list once with the given distribution. 您可以使用给定的分发对列表进行一次洗牌。 From then on, picking your m items ist just popping the list. 从那时起,选择你的m项只是弹出列表。

Here's an implementation of a probabilistic shuffle: 这是概率混乱的实现:

List<Item> prob_shuffle(Map<Item, int> dist)
{
    int n = dist.length;
    List<Item> a = dist.keys();
    int psum = 0;
    int i, j;

    for (i in dist) psum += dist[i];

    for (i = 0; i < n; i++) {
        int ip = rand(psum);    // 0 <= ip < psum
        int jp = 0;

        for (j = i; j < n; j++) {
            jp += dist[a[j]];
            if (ip < jp) break;
        }

        psum -= dist[a[j]];

        Item tmp = a[i];
        a[i] = a[j];
        a[j] = tmp;
    }
    return a;
}

This in not Java, but pseudocude after an implementation in C, so please take it with a grain of salt. 这不是Java,而是在C中实现后的伪文本,所以请带上一点点盐。 The idea is to append items to the shuffled area by continuously picking items from the unshuffled area. 我们的想法是通过从未洗涤的区域连续挑选物品来将物品附加到洗牌区域。

Here, I used integer probabilities. 在这里,我使用了整数概率。 (The proabilities don't have to add to a special value, it's just "bigger is better".) You can use floating-point numbers but because of inaccuracies, you might end up going beyond the array when picking an item. (可能性不必添加到特殊值,它只是“越大越好”。)您可以使用浮点数但由于不准确,您可能最终在选择项目时超出数组。 You should use item n - 1 then. 你应该使用项目n - 1然后。 If you add that saftey net, you could even have items with zero probability that always get picked last. 如果你添加那个安全网,你甚至可以拥有零概率的项目,总是最后被选中。

There might be a method to speed up the picking loop, but I don't really see how. 可能有一种方法可以加快拣选循环,但我真的不知道如何。 The swapping renders any precalculations useless. 交换使得任何预先计算都无用。

Accumulate your probabilities in a table 在表格中累积您的概率

               Probability
Item       Actual  Accumulated
Item1       0.10      0.10
Item2       0.30      0.40
Item3       0.15      0.55
Item4       0.20      0.75
Item5       0.25      1.00

Make a random number between 0.0 and 1.0 and do a binary search for the first item with a sum that is greater than your generated number. 创建一个介于0.0和1.0之间的随机数,并对第一个项目进行二进制搜索,其总和大于生成的数字。 This item would have been chosen with the desired probability. 将以期望的概率选择该项目。

Ebbe's method is called rejection sampling . Ebbe的方法称为拒绝抽样

I sometimes use a simple method, using an inverse cumulative distribution function , which is a function that maps a number X between 0 and 1 onto the Y axis. 我有时使用一种简单的方法,使用逆累积分布函数 ,该函数将0和1之间的数字X映射到Y轴上。 Then you just generate a uniformly distributed random number between 0 and 1, and apply the function to it. 然后,您只需生成0到1之间的均匀分布的随机数,并将该函数应用于它。 That function is also called the "quantile function". 该功能也称为“分位数功能”。

For example, suppose you want to generate a normally distributed random number. 例如,假设您要生成正态分布的随机数。 It's cumulative distribution function is called Phi . 它的累积分布函数叫做Phi The inverse of that is called probit . 相反的是probit There are many ways to generate normal variates, and this is just one example. 有很多方法可以生成正常的变量,这只是一个例子。

You can easily construct an approximate cumulative distribution function for any univariate distribution you like, in the form of a table. 您可以以表格的形式轻松地为您喜欢的任何单变量分布构建近似累积分布函数。 Then you can just invert it by table-lookup and interpolation. 然后你可以通过表查找和插值来反转它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM