简体   繁体   English

Randint并不总是遵循统一分布

[英]Randint doesn't always follow uniform distribution

I was playing around with the random library in Python to simulate a project I work and I found myself in a very strange position. 我正在玩Python中的随机库来模拟我工作的项目,我发现自己处于一个非常奇怪的位置。

Let's say that we have the following code in Python: 假设我们在Python中有以下代码:

from random import randint
import seaborn as sns

a = []
for i in range(1000000):
    a.append(randint(1,150))

sns.distplot(a)

The plot follows a “discrete uniform” distribution as it should. 该图遵循“离散均匀”分布。

范围在1和150之间

However, when I change the range from 1 to 110, the plot has several peaks. 但是,当我将范围从1更改为110时,绘图有几个峰值。

from random import randint
import seaborn as sns

a = []
for i in range(1000000):
    a.append(randint(1,110))

sns.distplot(a)

范围从1到110

My impression is that the peaks are on 0,10,20,30,... but I am not able to explain it. 我的印象是,峰值在0,10,20,30,...但我无法解释它。

Edit: The question was not similar with the proposed one as duplicate since the problem in my case was the seaborn library and the way I visualised the data. 编辑:问题与提议的问题并不相似,因为我的案例中的问题是seaborn库和我可视化数据的方式。

Edit 2: Following the suggestions on the answers, I tried to verify it by changing the seaborn library. 编辑2:根据对答案的建议,我尝试通过更改seaborn库来验证它。 Instead, using matplotlib both graphs were the same 相反,使用matplotlib两个图都是相同的

from random import randint
import matplotlib.pyplot as plt

a = []
for i in range(1000000):
    a.append(randint(1,110))

plt.hist(a) 

来自matplotlib

The problem seems to be in your grapher, seaborn , not in randint() . 问题似乎出现在你的seabornseaborn ,而不是在randint()

There are 50 bins in your seaborn distribution diagram, according to my count. 根据我的统计,你的seaborn分布图中有50个箱子。 It seems that seaborn is actually binning your returned randint() values in those bins, and there is no way to get an even spread of 110 values into 50 bins. 似乎seaborn实际上是在这些垃圾箱中对你返回的randint()值进行分类,并且没有办法将110个值均匀分布到50个垃圾箱中。 Therefore you get those peaks where three values get put into a bin rather than the usual two values for the other bins. 因此,您可以获得三个值放入bin中的峰值,而不是其他二进制值的通常两个值。 The values of your peaks confirm this: they are 50% higher than the other bars, as expected for 3 binned values rather than for 2. 您的峰值确认了这一点:它们比其他条形图高50%,正如预期的3个分档值而不是2个。

Another way for you to check this is to force seaborn to use 55 bins for these 110 values (or perhaps 10 bins or some other divisor of 110). 另一种检查方法是强制seaborn使用55个箱子来获得这110个值(或者可能是10个箱子或110个其他除数)。 If you still get the peaks, then you should worry about randint() . 如果你仍然得到了峰值,那么你应该担心randint()

To add to @RoryDaulton 's excellent answer, I ran randint(1:110) , generating a frequency count and the converting it to an R-vector of counts like this: 为了增加@RoryDaulton的优秀答案,我运行了randint(1:110) ,生成频率计数并将其转换为R randint(1:110) ,如下所示:

hits = {i:0 for i in range(1,111)}
for i in range(1000000): hits[randint(1,110)] += 1
hits = [hits[i] for i in range(1,111)]
s = 'c('+','.join(str(x) for x in hits)+')'
print(s)

c(9123,9067,9124,8898,9193,9077,9155,9042,9112,9015,8949,9139,9064,9152,8848,9167,9077,9122,9025,9159,9109,9015,9265,9026,9115,9169,9110,9364,9042,9238,9079,9032,9134,9186,9085,9196,9217,9195,9027,9003,9190,9159,9006,9069,9222,9205,8952,9106,9041,9019,8999,9085,9054,9119,9114,9085,9123,8951,9023,9292,8900,9064,9046,9054,9034,9088,9002,8780,9098,9157,9130,9084,9097,8990,9194,9019,9046,9087,9100,9017,9203,9182,9165,9113,9041,9138,9162,9024,9133,9159,9197,9168,9105,9146,8991,9045,9155,8986,9091,9000,9077,9117,9134,9143,9067,9168,9047,9166,9017,8944)

I then pasted this to an R-console, reconstructed the observations and used R's hist() on the result, obtaining this histogram (with superimposed density curve): 然后我将其粘贴到R控制台,重建观察结果并在结果上使用R的hist() ,获得此直方图(具有叠加的密度曲线):

在此输入图像描述

As you can see, this confirms that the problem you observed isn't traceable to randint but is an artifact of sns.displot() . 如您所见,这证实您观察到的问题无法追溯到randint但却是sns.displot()的工件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM