简体   繁体   中英

np.random.choice with a big probabilities array

I know that we can use a probability array for the choice function, but my question is how it works for big arrays. Let's assume that I want to have 1 thousand random numbers between 0-65535. How can we define the probability array to have p=0.4 for numbers less than 1000 and p=0.6 for the rest?

I tried to pass the range of numbers to the choice function, but apparently, it doesn't work like that.

From the docs , each element of the argument p gives the probability for the corresponding element in a .

Since p and a need to have the same size, create a p of the same size as a :

a = np.arange(65536)
n_elem = len(a)

p = np.zeros_like(a, dtype=float)

Now, find all the elements of a less than 1000 , and set p for those indices to 0.4 divided by the number of elements less than 1000. For this case, you can hardcode that calculation, since you know which elements of an arange are less than 1000:

p[:1000] = 0.4 / 1000
p[1000:] = 0.6 / 64536

For the general case where a is not derived from an arange , you could do:

lt1k = a < 1000
n_lt1k = lt1k.sum()

p[lt1k] = 0.4 / n_lt1k
p[~lt1k] = 0.6 / (n_elem - n_lt1k)

Note that p must sum to 1 :

assert np.allclose(p.sum(), 1.0)

Now use a and p in choice :

selection = np.random.choice(a, size=(1000,), p=p)

To verify that the probability of selecting a value < 1000 is 40%, we can check how many are less than 1000:

print((selection < 1000).sum() / len(selection)) # should print a number close to 0.4

An alternative would be to treat this as a mixture of two distributions: one that draws uniformly from {0..999} with probability = 0.4, and another that draws uniformly from {1000..65535} with probability = 0.6.

Using choice for the mixture component makes sense, but then I'd use something else to draw the values because when probabilities are passed to choice it does O( len(p) ) work every call to transform them. Generator.integers should be more efficient as it can sample your uniform values directly.

Putting this together, I'd suggest using something like:

import numpy as np

rng = np.random.default_rng()

n = 1000
splits = np.array([0, 1000, 65536])

# draw weighted mixture components
s = rng.choice(2, n, p=[0.4, 0.6])
# draw uniform values according to component
result = rng.integers(splits[s], splits[s+1])

You can verify this is drawing from the correct distribution by evaluating np.mean(result < 1000) and checking it's "close" to 0.4. The variance of that is approximately 0.4*0.6 / n , so, for n=1000 , values in [0.37, 0.43] should be seen 95% of the time.

This method should remain fast while max(splits) - min(splits) gets larger, while Pranav's solution of using choice directly will slow down.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM