简体   繁体   English

在matplotlib直方图中是否存在binning错误?或者scipy.stats中rvs方法的非随机性

[英]Is there a bug in binning in matplotlib histograms? Or non-randomness of the rvs method in scipy.stats

The following code consistently produces histograms with bins that are empty, even when the number of samples are large. 以下代码始终生成直方图,其中的二进制文件为空,即使样本数量很大。 The empty bins seem to have a regular spacing, but are the same width as other normal bins. 空箱似乎有规则的间距,但宽度与其他普通箱相同。 This is obviously wrong - why is this happening? 这显然是错误的 - 为什么会发生这种情况? It seems like either the rvs method is non-random, or the hist binning procedure is hooped. 似乎rvs方法是非随机的,或者组织分箱程序是箍的。 Also, try altering the number of bins to 50, and another weirdness emerges. 另外,尝试将箱数改为50,并出现另一种奇怪现象。 In this case, it looks like every other bin has a spuriously high count associated with it. 在这种情况下,看起来每个其他bin都有与之相关的虚假高数。

""" An example of how to plot histograms using matplotlib
This example samples from a Poisson distribution, plots the histogram
and overlays the Gaussian with the same mean and standard deviation

"""

from scipy.stats import poisson
from scipy.stats import norm
from matplotlib import pyplot as plt
#import matplotlib.mlab as mlab

EV = 100   # the expected value of the distribution
bins = 100 # number of bins in our histogram
n = 10000
RV = poisson(EV)  # Define a Poisson-distributed random variable

samples = RV.rvs(n)  # create a list of n random variates drawn from that random variable

events, edges, patches = plt.hist(samples, bins, normed = True, histtype = 'stepfilled')  # make a histogram

print events  # When I run this, some bins are empty, even when the number of samples is large

# the pyplot.hist method returns a tuple containing three items. These are events, a list containing
# the counts for each bin, edges, a list containing the values of the lower edge of each bin
# the final element of edges is the value of the high edge of the final bin
# patches, I'm not quite sure about, but we don't need at any rate
# note that we really only need the edges list, but we need to unpack all three elements of the tuple
# for things to work properly, so events and patches here are really just dummy variables

mean = RV.mean()  # If we didn't know these values already, the mean and std methods are convenience
sd = RV.std()     # methods that allow us to retrieve the mean and standard deviation for any random variable

print "Mean is:", mean, " SD is: ", sd

#print edges

Y = norm.pdf(edges, mean, sd)  # this is how to do it with the sciPy version of a normal PDF
# edges is a list, so this will return a list Y with normal pdf values corresponding to each element of edges

binwidth = (len(edges)) / (max(edges) - min(edges))
Y = Y * binwidth
print "Binwidth is:", 1/binwidth
# The above is a fix to "de-normalize" the normal distribution to properly reflect the bin widths

#Q = [edges[i+1] - edges[i] for i in range(len(edges)-1)]
#print Q  # This was to confirm that the bins are equally sized, which seems to be the case.

plt.plot(edges, Y)
plt.show()

在此输入图像描述

The empty bins are to be expected when your input data only takes integer values (as is the case for the Poisson RV ) and you have more bins than this interval. 当您的输入数据仅采用整数值时(例如泊松RV的情况),您可以预期空箱,并且您有比此间隔更多的箱。 If that's the case you'll have bins that will never capture a sample and some bins that will capture more than one intervals sample. 如果是这种情况,您将拥有永远不会捕获样本的容器和一些将捕获多个间隔样本的容器。 Change the number of bins and the range to capture an integer interval and the gaps go away. 更改箱数和范围以捕获整数间隔,并且间隙消失。

plt.hist(samples, 
         range=(0,samples.max()),
         bins=samples.max()+1, 
         normed = True, histtype = 'stepfilled')

在此输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM