简体   繁体   English

np.random.choice 具有很大的概率数组

[英]np.random.choice with a big probabilities array

I know that we can use a probability array for the choice function, but my question is how it works for big arrays. Let's assume that I want to have 1 thousand random numbers between 0-65535.我知道我们可以使用概率数组来选择 function,但我的问题是它如何适用于大 arrays。假设我想要 0-65535 之间的 1000 个随机数。 How can we define the probability array to have p=0.4 for numbers less than 1000 and p=0.6 for the rest?我们如何定义概率数组,使其对于小于 1000 的数字具有 p=0.4,对于 rest 具有 p=0.6?

I tried to pass the range of numbers to the choice function, but apparently, it doesn't work like that.我试图将数字范围传递给选项 function,但显然,它不是那样工作的。

From the docs , each element of the argument p gives the probability for the corresponding element in a .文档中,参数p的每个元素都给出了a对应元素的概率。

Since p and a need to have the same size, create a p of the same size as a :由于pa需要具有相同的大小,因此创建一个与 a 大小相同a p

a = np.arange(65536)
n_elem = len(a)

p = np.zeros_like(a, dtype=float)

Now, find all the elements of a less than 1000 , and set p for those indices to 0.4 divided by the number of elements less than 1000. For this case, you can hardcode that calculation, since you know which elements of an arange are less than 1000:现在,找到a小于1000的所有元素,并将这些索引的p设置为 0.4 除以小于 1000 的元素数。对于这种情况,您可以对该计算进行硬编码,因为您知道arange的哪些元素更少大于 1000:

p[:1000] = 0.4 / 1000
p[1000:] = 0.6 / 64536

For the general case where a is not derived from an arange , you could do:对于a不是从arange派生的一般情况,您可以这样做:

lt1k = a < 1000
n_lt1k = lt1k.sum()

p[lt1k] = 0.4 / n_lt1k
p[~lt1k] = 0.6 / (n_elem - n_lt1k)

Note that p must sum to 1 :请注意, p的总和必须为1

assert np.allclose(p.sum(), 1.0)

Now use a and p in choice :现在在choice中使用ap

selection = np.random.choice(a, size=(1000,), p=p)

To verify that the probability of selecting a value < 1000 is 40%, we can check how many are less than 1000:为了验证选择值 < 1000 的概率是 40%,我们可以检查有多少小于 1000:

print((selection < 1000).sum() / len(selection)) # should print a number close to 0.4

An alternative would be to treat this as a mixture of two distributions: one that draws uniformly from {0..999} with probability = 0.4, and another that draws uniformly from {1000..65535} with probability = 0.6.另一种方法是将其视为两种分布的混合:一种以概率 = 0.4 均匀地从 {0..999} 抽取,另一种以概率 = 0.6 均匀地从 {1000..65535} 抽取。

Using choice for the mixture component makes sense, but then I'd use something else to draw the values because when probabilities are passed to choice it does O( len(p) ) work every call to transform them.对混合组件使用choice是有道理的,但随后我会使用其他东西来绘制值,因为当概率传递给choice时,它会在每次调用时 O( len(p) ) 工作以转换它们。 Generator.integers should be more efficient as it can sample your uniform values directly. Generator.integers应该更有效,因为它可以直接对统一值进行采样。

Putting this together, I'd suggest using something like:把这些放在一起,我建议使用类似的东西:

import numpy as np

rng = np.random.default_rng()

n = 1000
splits = np.array([0, 1000, 65536])

# draw weighted mixture components
s = rng.choice(2, n, p=[0.4, 0.6])
# draw uniform values according to component
result = rng.integers(splits[s], splits[s+1])

You can verify this is drawing from the correct distribution by evaluating np.mean(result < 1000) and checking it's "close" to 0.4.您可以通过评估np.mean(result < 1000)并检查它是否“接近”0.4 来验证这是从正确的分布中提取的。 The variance of that is approximately 0.4*0.6 / n , so, for n=1000 , values in [0.37, 0.43] should be seen 95% of the time.方差约为0.4*0.6 / n ,因此,对于n=1000 , [0.37, 0.43] 中的值应该在 95% 的时间内可见。

This method should remain fast while max(splits) - min(splits) gets larger, while Pranav's solution of using choice directly will slow down.max(splits) - min(splits)变大时,此方法应该保持快速,而 Pranav 直接使用choice的解决方案会变慢。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM