[英]Vectorised np.random.choice with varying probabilities
I've trained a machine learning model using sklearn and want to simulate the result by sampling the predictions according to the predict_proba probabilities.我已经使用 sklearn 训练了一个机器学习模型,并希望通过根据 predict_proba 概率对预测进行采样来模拟结果。 So I want to do something like所以我想做类似的事情
samples = np.random.choice(a = possible_outcomes, size = (n_data, n_samples), p = probabilities)
Where probabilities would be is an (n_data, n_possible_outcomes) array
But np.random.choice only allows 1d arrays for the p argument.但是 np.random.choice 只允许 p 参数使用一维数组。 I've currently gotten around this using a for-loop like the following implementation我目前已经使用如下实现的 for 循环解决了这个问题
sample_outcomes = np.zeros((len(probs), n_samples))
for i in trange(len(probs)):
sample_outcomes[i, :] = np.random.choice(outcomes, s = n_samples, p=probs[i])
but that's relatively slow.但这相对较慢。 Any suggestions to speed this up would be much appreciated!任何加快速度的建议将不胜感激!
Here is an example of what you can do, if I understand your question correctly:如果我正确理解您的问题,以下是您可以做什么的示例:
import numpy as np
#create a list of indices
index_list = np.arange(len(possible_outcomes))
# sample indices based on the probabilities
choice = np.random.choice(a = index_list, size = n_samples, p = probabilities)
# get samples based on randomly chosen indices
samples = possible_outcomes[choice]
If I understood correctly you want a vectorize way of applying choice several times and each time with a different probabilities vector.如果我理解正确,您需要一种多次应用选择的矢量化方式,并且每次都使用不同的概率向量。 You could implement this by hand as follows:您可以按如下方式手动实现:
import numpy as np
# for reproducibility
np.random.seed(42)
# number of samples
k = 5
# possible outcomes
outcomes = np.arange(10)
# generate a random probability matrix for 15 runs
probabilities = np.random.random((15, 10))
probs = probabilities / probabilities.sum(1)[:, None]
# generate the choices by picking those probabilities above a random generated number
# the higher the value in probs the higher the probability to pick it
choices = probs - np.random.random((15, 10))
# to pick the top k using argpartition need to multiply by -1
choices = -1 * choices
# pick the top k values
res = outcomes[np.argpartition(choices, k, axis=1)][:, :k]
# flatten to match the expected output
print(res.flatten())
Output输出
[1 8 2 5 3 6 4 8 7 0 1 5 9 3 7 1 4 9 0 8 5 0 4 3 6 8 5 1 2 6 5 3 2 0 6 5 4
2 3 7 7 9 4 6 1 3 6 4 2 1 4 9 3 0 1 6 9 2 3 8 5 4 7 6 1 5 3 8 2 1 1 0 9 7
4]
In the above example the code sample 5 ( k
) elements from a population of 10 ( outcomes
) 15 times each time with a different probability vector ( probs
with a shape of 15 by 10).在上面的例子中的代码样品5( k
从图10(a人口)元素outcomes
)15次,每次使用不同的概率向量(时间probs
与10的15的形状)。
I'm making sure I understand you problem correctly.我确保我正确理解你的问题。 Can you just create samples
as an array of size n_data * n_samples
and then use the resize method to get it to the right size?您可以将samples
创建为大小为n_data * n_samples
的数组,然后使用 resize 方法将其设置为正确的大小吗?
samples = np.random.choice(a = possible_outcomes, size = n_data * n_samples, p = probabilities)
samples.resize((n_data, n_samples))
If you use NumPy's newer interface to random number generation, what you want should be simple: https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.choice.html如果您使用 NumPy 的新界面来生成随机数,那么您想要的应该很简单: https : //numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.choice.html
See here for an example:请参见此处的示例:
samples = np.random.choice(array_of_samps, n_samples, p=probs)
Note, though, that len(probs)
would equal array_of_samps.shape[0]
(ie the number of rows in array_of_samps
), not samples.shape[0]
.但请注意, len(probs)
将等于array_of_samps.shape[0]
(即array_of_samps
的行数),而不是samples.shape[0]
。 Each row of samples
would be a randomly chosen row of array_of_samps
.每行samples
将是array_of_samps
的随机选择行。
Judging from the shape of your sample_outcomes
array, sample_outcomes
is probably samples.T
.从您的sample_outcomes
数组的形状来看, sample_outcomes
可能是samples.T
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.