简体   繁体   English

标准化直方图 y 轴大于 1

[英]Normed histogram y-axis larger than 1

Sometimes when I create a histogram, using say seaborn's displot function, with norm_hist = True, the y-axis is less than 1 as expected for a PDF. Other times it takes on values greater than one.有时,当我使用 seaborn 的 displot function 和 norm_hist = True 创建直方图时,y 轴小于 1,如 PDF 所预期的那样。其他时候它的值大于 1。

For example if I run例如,如果我跑

        sns.set(); 
        x = np.random.randn(10000)
        ax = sns.distplot(x)

Then the y-axis on the histogram goes from 0.0 to 0.4 as expected, but if the data is not normal the y-axis can be as large as 30 even if norm_hist = True.然后直方图上的 y 轴按预期从 0.0 变为 0.4,但如果数据不正常,即使 norm_hist = True,y 轴也可能大到 30。

What am I missing about the normalization arguments for histogram functions, eg norm_hist for sns.distplot?关于直方图函数的规范化 arguments,我缺少什么,例如 sns.distplot 的 norm_hist? Even if I normalize the data myself by creating a new variable thus:即使我自己通过创建一个新变量来规范化数据:

        new_var = data/sum(data)

so that the data sums to 1, the y-axis will still show values far larger than 1 (like 30 for example) whether the norm_hist argument is True or not.因此数据总和为 1,无论 norm_hist 参数是否为 True,y 轴仍将显示远大于 1 的值(例如 30)。

What interpretation can I give when the y-axis has such a large range?当y轴有这么大的范围时,我可以给出什么解释?

I think what is happening is my data is concentrated closely around zero so in order for the data to have an area equal to 1 (under the kde for example) the height of the histogram has to be larger than 1...but since probabilities can't be above 1 what does the result mean?我认为正在发生的事情是我的数据集中在零附近所以为了使数据的面积等于 1(例如在 kde 下)直方图的高度必须大于 1 ...但是由于概率不能高于 1 结果是什么意思?

Also, how can I get these functions to show probability on the y-axis?另外,如何让这些函数在 y 轴上显示概率?

The rule isn't that all the bars should sum to one.规则不是所有条形图的总和为 1。 The rule is that all the areas of all the bars should sum to one.规则是所有条形图的所有面积之和应为 1。 When the bars are very narrow, their sum can be quite large although their areas sum to one.当条形非常窄时,它们的总和可能非常大,尽管它们的面积总和为 1。 The height of a bar times its width is the probability that a value would all in that range.条形的高度乘以其宽度是一个值都在该范围内的概率。 To have the height being equal to the probability, you need bars of width one.要使高度等于概率,您需要宽度为 1 的条。

Here is an example to illustrate what's going on.这是一个例子来说明发生了什么。

import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns


fig, axs = plt.subplots(ncols=2, figsize=(14, 3))

a = np.random.normal(0, 0.01, 100000)
sns.distplot(a, bins=np.arange(-0.04, 0.04, 0.001), ax=axs[0])
axs[0].set_title('Measuring in meters')
axs[0].containers[0][40].set_color('r')

a *= 1000
sns.distplot(a, bins=np.arange(-40, 40, 1), ax=axs[1])
axs[1].set_title('Measuring in milimeters')
axs[1].containers[0][40].set_color('r')

plt.show()

演示图

The plot at the left uses bins of 0.001 meter wide.左侧的 plot 使用0.001米宽的箱。 The highest bin (in red) is about 40 high.最高的箱子(红色)大约是40高。 The probability that a value falls into that bin is 40*0.001 = 0.04 .一个值落入该 bin 的概率是40*0.001 = 0.04

The plot at the right uses exactly the same data, but measures in milimeter.右侧的 plot 使用完全相同的数据,但以毫米为单位。 Now the bins are 1 mm wide.现在垃圾箱有1 mm宽。 The highest bin is about 0.04 high.最高的 bin 大约是0.04高。 The probability that a value falls into that bin is also 0.04 , because of the bin width of 1 .值落入该 bin 的概率也是0.04 ,因为 bin 宽度为1

PS: As an example of a distribution for which the probability density function has zones larger than 1, see the Pareto distribution with α = 3 . PS:作为概率密度 function 的区域大于 1 的分布示例,请参见α = 3帕累托分布

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM