简体   繁体   English

Matplotlib 直方图错位和缺失柱

[英]Matplotlib histogram misplaced and missing bars

I have large data files and thus am using numpy histogram (same as used in matplotlib) to manually generate histograms and update them.我有大型数据文件,因此我使用 numpy 直方图(与 matplotlib 中使用的相同)手动生成直方图并更新它们。 However, at plotting, I feel that the graph is shifted.但是,在绘图时,我觉得图表发生了变化。

This is the code I use to manually create and update histograms in batches.这是我用来批量手动创建和更新直方图的代码。 Note that all histograms share the same bins.请注意,所有直方图共享相同的 bin。

temp = np.histogram(batch, bins=np.linspace(0, 40, 41))
hist += temp[0]

The code above is repeated as I parse the data files.上面的代码在我解析数据文件时重复。 For example, a small data set would have the following as the final histogram data:例如,一个小型数据集将具有以下作为最终直方图数据:

[8190, 666, 278, 145, 113, 83, 52, 48, 45, 44, 45, 29, 28, 45, 29, 15, 16, 10, 17, 7, 15, 6, 10, 7, 3, 5, 7, 4, 2, 3, 0, 1, 0, 0, 0, 0, 0, 0, 0, 29]

Below is the plotting code.下面是绘图代码。

import matplotlib
matplotlib.use('agg')
import matplotlib.pyplot as plt
import numpy as np
plt.xticks(np.linspace(0, 1, 11))
plt.hist([i/40 for i in range(40)], bins=np.linspace(0, 1, 41), weights=scores, rwidth=0.7)
plt.yscale('log', nonposy='clip')

The resulting figure is quite strange.得到的数字很奇怪。 It shows no bar at [0.475, 0.5) and I expect the 0.975 bin which is range [0.975, 1.0] to include the last 29 values.它在 [0.475, 0.5) 处没有显示条形图,我希望范围为 [0.975, 1.0] 的 0.975 bin 包含最后 29 个值。 However instead, I see that bar at the [0.950, 0.975) position.然而,相反,我在 [0.950, 0.975) position 处看到了该条。 I thought this might have to do with using bins and linspace, but the size of the decoy array and weights are the same.我认为这可能与使用 bin 和 linspace 有关,但诱饵数组的大小和权重是相同的。

在此处输入图像描述

I'm never seen this kind of behavior.我从未见过这种行为。 I also thought it would be the way the ranges are [ x, x+width), but I haven't had issues with this.我还认为范围是 [ x, x+width) 的方式,但我对此没有任何问题。

A note on using linspace.关于使用 linspace 的说明。 It specifies edges, so 40 bins is specified by 41 edges.它指定边,因此 40 个 bin 由 41 个边指定。

In [2]: np.linspace(0,1,41)                                                     
Out[2]: 
array([0.   , 0.025, 0.05 , 0.075, 0.1  , 0.125, 0.15 , 0.175, 0.2  ,
       0.225, 0.25 , 0.275, 0.3  , 0.325, 0.35 , 0.375, 0.4  , 0.425,
       0.45 , 0.475, 0.5  , 0.525, 0.55 , 0.575, 0.6  , 0.625, 0.65 ,
       0.675, 0.7  , 0.725, 0.75 , 0.775, 0.8  , 0.825, 0.85 , 0.875,
       0.9  , 0.925, 0.95 , 0.975, 1.   ])

In [3]: len(np.linspace(0,1,41))                                                
Out[3]: 41

It seems you're using plt.hist with the idea to put one value into each bin, so simulating a bar plot.看来您使用plt.hist的想法是在每个 bin 中放入一个值,因此模拟了一个条形 plot。 As the x-values fall exactly on the bin bounds, due to rounding they might end up in the neighbor bin.由于 x 值恰好落在 bin 边界上,因此由于四舍五入,它们最终可能会出现在相邻 bin 中。 That could be mitigated by moving the x-values half a bin width.这可以通过将 x 值移动半个 bin 宽度来缓解。 The simplest is drawing the bars directly.最简单的就是直接画条。

The following code creates a bar plot with the given data, with each bar at the center of the region it represents.以下代码使用给定数据创建一个条形 plot,每个条形位于它所代表的区域的中心。 As a check, the bars are measured again at the end and their height displayed.作为检查,最后再次测量条形并显示它们的高度。

from  matplotlib.ticker import MultipleLocator
import matplotlib.pyplot as plt
import numpy as np

scores =[8190,666,278,145,113,83,52,48,45,44,45,29,28,45,29,15,16,10,17,7,15,6,10,7,3,5,7,4,2,3,0,1,0,0,0,0,0,0,0,29]
binbounds = np.linspace(0, 1, 41)
rwidth = 0.7
width = binbounds[1] - binbounds[0]
bars = plt.bar(binbounds[:-1] + width / 2, height=scores, width=width * rwidth, align='center')
plt.gca().xaxis.set_major_locator(MultipleLocator(0.1))
plt.gca().xaxis.set_minor_locator(MultipleLocator(0.05))
plt.yscale('log', nonposy='clip')
for rect in bars:
    x, y = rect.get_xy()
    w = rect.get_width()
    h = rect.get_height()
    plt.text(x + w / 2, h, f'{h}\n', ha='center', va='center')
plt.show()

结果图

PS: To see what's happening with the original histogram, just do a test plot without the weights: PS:要查看原始直方图发生了什么,只需在没有权重的情况下进行测试 plot :

plt.hist([i/40 for i in range(40)], bins=np.linspace(0, 1, 41), rwidth=1, ec='k')
plt.plot([i/40 for i in range(40)], [0.5] * 40, 'ro')
plt.xticks(np.linspace(0, 1, 11))

A red dot shows where the x-values are.红点显示 x 值的位置。 Some fall into the correct bin, some into the neighbor which suddenly gets 2 values.有些落入正确的箱子中,有些落入突然得到 2 个值的邻居。 没有权重的直方图

To create a histogram with the x-values at the center of each bin:要使用每个 bin 中心的 x 值创建直方图:

plt.hist([i/40 + 1/80 for i in range(40)], bins=np.linspace(0, 1, 41), rwidth=1, ec='k')
plt.plot([i/40 + 1/80 for i in range(40)], [0.5] * 40, 'ro')
plt.xticks(np.linspace(0, 1, 11))
plt.yticks([0, 1])

bin 中心的 x 值

The problem is due to the rounding error of np.linspace(0, 1, 11).问题是由于 np.linspace(0, 1, 11) 的舍入误差造成的。

bins = []
for abin in np.linspace(0, 1, 41):
    bins.append(abin)

The code above will get上面的代码会得到

bins = [0.0, 0.025, 0.05, 0.07500000000000001, 0.1, 0.125, 0.15000000000000002, ...] 

,which causes the problem. ,从而导致问题。

However, when you do np.round(np.linspace(0, 1, 41), 4), the problem is fixed.但是,当您执行 np.round(np.linspace(0, 1, 41), 4) 时,问题就解决了。

Example:例子:

plt.hist([i/40 for i in range(40)], bins=np.round(np.linspace(0, 1, 41), 4), rwidth=1, ec='k')
plt.plot([i/40 for i in range(40)], [0.5] * 40, 'ro')
plt.xticks(np.linspace(0, 1, 11))

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM