简体   繁体   English

如何基于数组加权随机数

[英]How to weight a random number based on an array

I've been thinking about how to implement something that, frankly, is beyond my mathematical skills. 我一直在考虑如何实现一些坦率地说超出我的数学技能的东西。 So here goes, feel free to try and point me in the right direction rather than complete code solutions any help I'd be grateful for. 所以在这里,随意尝试指出我正确的方向,而不是完整的代码解决方案任何帮助,我将不胜感激。

So, imagine I've done an analysis of text and generated a table of the frequencies of different two-character combinations. 所以,想象一下,我已经对文本进行了分析,并生成了一个不同的两个字符组合的频率表。 I've stored these in a 26x26 array. 我将它们存储在26x26阵列中。 eg. 例如。

  A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
A 1 15 (frequency of AA, then frequency of AB etc.)
B 12 0 (freq of BA, BB etc..)
... etc.

So I want to randomly choose these two-character combinations but I'd like to 'weight' my choice based on the frequency. 所以我想随机选择这两个字符组合,但我想根据频率“权衡”我的选择。 ie. 即。 the AB from above should be 15 times 'more likely' than AA. 上面的AB应该比AA高15倍。 And, obviously, the selection should never return something like BB (ie. a frequency of 0 - in this example, obviously BB does occur in words like Bubble!! :-) ). 并且,显然,选择不应该返回类似BB的东西(即频率为0 - 在这个例子中,显然BB确实出现在像Bubble !! :-)这样的单词中)。 For the 0 case I realise I could loop until I get a non-0 frequency but that's just not elegant because I have a feeling/intuition that there is a way to skew my average. 对于0的情况,我意识到我可以循环,直到我得到一个非0频率,但这只是不优雅,因为我有一种感觉/直觉,有一种方法来扭曲我的平均值。

I was thinking to chose the first char of my pair - ie. 我想要选择我的第一个字符 - 即。 the row - (I'm generating a 4-pair-sequence ultimately) I could just use the system random function (Random class.Next) then use the 'weighted' random algorithm to pick the second char. 行 - (我最终生成一对4对序列)我可以使用系统随机函数(Random class.Next)然后使用'加权'随机算法来选择第二个字符。

Any ideas? 有任何想法吗?

Given your example sample, I would first create a cumulative series of all of the numbers (1, 15, 12, 0 => 1, 16, 28, 28). 给出您的示例示例,我将首先创建所有数字的累积序列(1,15,12,0 => 1,16,28,28)。

Then I would produce a random number between 0 and 27 (let's say 19). 然后我会产生一个0到27之间的随机数(比方说19)。

Then I would calculate that 19 was >=16 but <28, giving me bucket 3 (BA). 然后我会计算出19 => 16但<28,给我3桶(BA)。

There are some good suggestions in the other answers for your specific problem. 对于您的具体问题,其他答案中有一些很好的建议。 To solve the general problem of "I have a source of random numbers conforming to a uniform probability distribution, but I would like it to conform to a given nonuniform probability distribution", then you can work out the quantile function , which is the function that performs that transformation. 为了解决“我有符合均匀概率分布的随机数的来源,但我想它符合给定不均匀的概率分布”的一般问题,那么你就可以制定出位数的功能 ,这是功能执行转换。 I give a gentle introduction that explains why the quantile function is the function you want here: 我给出了一个温和的介绍,解释了为什么分位数函数是你想要的函数:

Generating Random Non-Uniform Data In C# 在C#中生成随机非均匀数据

How about summing all the frequencies and using that from AA to ZZ to generate your pair. 如何对所有频率求和并使用从AA到ZZ的频率来生成您的对。

Lets say you have a total frequency of pairs if the rnd return 0 you get AA if it returns 1-14 then its AB etc 假设你有一个总频率的对,如果rnd返回0你获得AA如果它返回1-14然后它的AB等

Use your frequency matrix to generate a complete set of values. 使用频率矩阵生成一组完整的值。 Order the set by Random.Next(). 通过Random.Next()对集合进行排序。 Store the randomized set in an array. 将随机集存储在数组中。 Then you can just select an element out if that array based on Random.Next(randomarray.Length). 然后你可以选择一个元素,如果该数组基于Random.Next(randomarray.Length)。

If there is a mathematical way to calculate the frequency you could do that as well. 如果有一种数学方法来计算频率,你也可以这样做。 But creating a precompiled and cached set will reduce the calculation time if this is called repeatedly. 但是,如果重复调用此方法,则创建预编译和缓存集将减少计算时间。

As a note, depending on the max frequency this could require a good amount of storage. 请注意,根据最大频率,这可能需要大量存储空间。 You would also want to create the instance of random before you loop to build the set. 在循环构建集合之前,您还需要创建随机实例。 This is so you don't reseed the random generator. 这样您就不会重新设置随机生成器。

... ...

Another way (similar to what you suggested at the end of your question) would be to do this in two passes with the first selecting the row and the second used your weighted frequency to select the column. 另一种方式(类似于你在问题末尾的建议)将在两次通过中执行此操作,第一次选择行,第二种方式使用加权频率选择列。 That would just be the sum of the row frequencies bounded over a ranges. 这只是在一个范围内限制的行频率的总和。 The first suggestion should give a more even distribution based on weight. 第一个建议应该根据重量给出更均匀的分布。

Take the sum of the probabilities. 取概率的总和。 Take a random number between zero and that sum. 取零和该总和之间的随机数。 Add up the probabilities until you get it's greater than or equal to your random number. 将概率相加,直到得到它大于或等于随机数。 Then use the item your on. 然后使用你的项目。

Eg pseudocode: 例如伪代码:

b = getProbabilites()
s = sum(b)
r = randomInt() % s
i = 0
acc = 0
while (acc < r) {
    acc += b[i]
    i++
}

return i

If efficiency is not a problem, you could create a key->value hash instead of an array. 如果效率不是问题,则可以创建key-> value哈希而不是数组。 An upside of this would be that (if you format it well in the text) it would be very easy to update the values should the need arise. 这样做的好处是(如果你在文本中很好地格式化),如果需要,更新值将非常容易。 Something like 就像是

{
    AA => 5, AB => 2, AC => 4,
    BA => 6, BB => 5, BC => 9,
    CA => 2, CB => 7, CC => 8
}

With this, you could easily retrieve the value for the sequence you want, and quickly find the entry to update. 这样,您可以轻松检索所需序列的值,并快速找到要更新的条目。 If the table is automatically generated and extremely large, it could help to get/be familiar with vim's use of regular expressions. 如果表是自动生成的并且非常大,那么可以帮助熟悉vim对正则表达式的使用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM