[英]np.random.choice not producing expected histogram
I'm looking to generate random normally distributed
numbers between 1 and 0, but as the mean
moves closer to 1 or 0, the right or left side respectively becomes "squished".我正在寻找生成 1 和 0 之间的
random normally distributed
数字,但随着mean
接近 1 或 0,右侧或左侧分别变得“压扁”。
After modifying the normal distribution and playing around with sliders in geogebra, I came up with the following:修改正态分布并在 geogebra 中使用滑块后,我得出以下结论:
Next I needed to create a method in python
which would generate random samples that would be distributed according to this PDF.接下来我需要在
python
中创建一个方法,它会生成随机样本,这些样本将根据这个 PDF 进行分配。
Originally I thought the only way to do this was to try and derive a new equation for generating random numbers as seen in the Box-Muller
proof (which I got by following along with this tutorial).最初我认为做到这一点的唯一方法是尝试推导一个新的方程来生成随机数,如
Box-Muller
证明中所示(我通过跟随本教程获得)。
However, I thought there might be an easier way to do this by using the numpy
library's np.random.choice()
method.但是,我认为使用
numpy
库的np.random.choice()
方法可能有更简单的方法来执行此操作。
After all, I should be able to integrate the PDF at a very small step size and get the various probabilities for said steps (approximately of course).毕竟,我应该能够以非常小的步长对 PDF 进行积分,并获得所述步长的各种概率(当然是近似值)。
So with that I wrote the following script:因此,我编写了以下脚本:
# Standard libs
import math
# Third party libs
import numpy as np
from alive_progress import alive_bar
from matplotlib import pyplot as plt
class RandomNumberGenerator:
def __init__(self):
pass
def clamped_normal_distribution(self, mu: float,
stddev: float, x: float):
""" Computes a value from the clamped normal distribution """
divideByZeroAvoider = 1e-5
if x < 0 or x > 1:
return 0
elif x >= 0 and x <= mu:
return math.exp(-0.5*( (x - mu) / (stddev) )**2 \
* (1/(x**2 + divideByZeroAvoider)))
elif x <= 1 and x > mu:
return math.exp(-0.5*( (x - mu) / (stddev) )**2 \
* (1/((1-x)**2 + divideByZeroAvoider)))
else:
print("This shouldn't happen!: {}".format(x))
return 0
if __name__ == '__main__':
rng = RandomNumberGenerator()
mu = 0.7
stddev = 1
stepSize = 1e-3
x = np.linspace(stepSize,1, int(1/stepSize) - 1)
# Determine the total area under the curve
samples = []
print("Generating samples...")
with alive_bar(len(x.tolist())) as bar:
for i in x:
samples.append(rng.clamped_normal_distribution(
mu, stddev, i))
bar()
area = np.trapz(samples, dx=stepSize)
print("Area = {}".format(area))
# Determine the probability of x falling in a specific interval
probabilities = []
print("Generating probabilties...")
with alive_bar(len(x.tolist())) as bar:
for i in x:
lead = rng.clamped_normal_distribution(mu,
stddev, i)
lag = rng.clamped_normal_distribution(mu,
stddev, i - stepSize)
probability = np.trapz(
np.array([lag, lead]),
dx=stepSize)
# Divide by the area because this isn't a standard normal
probabilities.append(probability / area)
bar()
# Should be approximately 1
print("Probability: {}".format(sum(probabilities)))
plt.plot(x, probabilities)
plt.show()
y = []
print("Performing distribution test...")
testSize = int(10e3)
with alive_bar(testSize) as bar:
for _ in range(testSize):
randSamp = np.random.choice(samples, p=probabilities)
y.append(randSamp)
bar()
plt.hist(y,300)
plt.show()
The first plot of the probabilities against the linearly spaced samples looks promising, giving me the following graph:线性间隔样本的第一个 plot 概率看起来很有希望,给出了下图:
However, if we use these samples as choices with given probabilities, we get the following histogram:但是,如果我们使用这些样本作为具有给定概率的选择,我们将得到以下直方图:
I have no idea why this isn't working correctly.我不知道为什么这不能正常工作。
I've tried other (smaller) examples like the ones listed on the numpy website , and they produce histograms of the according to the given probabilities array.我尝试了其他(较小的)示例,例如numpy 网站上列出的示例,它们根据给定的概率数组生成直方图。
I'd really appreciate some advice/intuition if at all possible:).如果可能的话,我真的很感激一些建议/直觉:)。
It looks like there is a problem with the first argument in the call np.random.choice(samples, p=probabilities)
.看起来调用
np.random.choice(samples, p=probabilities)
中的第一个参数有问题。 The first argument should be x
, not samples
.第一个参数应该是
x
,而不是samples
。
ADDITION BY AUTHOR:作者添加:
The reason for this is the samples
are the values of the curve (ie the y-axis and NOT the x-axis).这样做的原因是
samples
是曲线的值(即 y 轴而不是 x 轴)。
Thus the values with the highest probabilities (ie the samples around the mean) all have a value of ~1, which is why we see such a massive spike around the value 1.因此,具有最高概率的值(即平均值附近的样本)的值都约为 1,这就是为什么我们在值 1 附近看到如此巨大的尖峰。
Changing this to x
gives us the following graphs (for 10e3
samples):将其更改为
x
可为我们提供以下图表(对于10e3
样本):
Working as expected, very nice.按预期工作,非常好。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.