简体   繁体   中英

Matplotlib histogram misplaced and missing bars

I have large data files and thus am using numpy histogram (same as used in matplotlib) to manually generate histograms and update them. However, at plotting, I feel that the graph is shifted.

This is the code I use to manually create and update histograms in batches. Note that all histograms share the same bins.

temp = np.histogram(batch, bins=np.linspace(0, 40, 41))
hist += temp[0]

The code above is repeated as I parse the data files. For example, a small data set would have the following as the final histogram data:

[8190, 666, 278, 145, 113, 83, 52, 48, 45, 44, 45, 29, 28, 45, 29, 15, 16, 10, 17, 7, 15, 6, 10, 7, 3, 5, 7, 4, 2, 3, 0, 1, 0, 0, 0, 0, 0, 0, 0, 29]

Below is the plotting code.

import matplotlib
matplotlib.use('agg')
import matplotlib.pyplot as plt
import numpy as np
plt.xticks(np.linspace(0, 1, 11))
plt.hist([i/40 for i in range(40)], bins=np.linspace(0, 1, 41), weights=scores, rwidth=0.7)
plt.yscale('log', nonposy='clip')

The resulting figure is quite strange. It shows no bar at [0.475, 0.5) and I expect the 0.975 bin which is range [0.975, 1.0] to include the last 29 values. However instead, I see that bar at the [0.950, 0.975) position. I thought this might have to do with using bins and linspace, but the size of the decoy array and weights are the same.

在此处输入图像描述

I'm never seen this kind of behavior. I also thought it would be the way the ranges are [ x, x+width), but I haven't had issues with this.

A note on using linspace. It specifies edges, so 40 bins is specified by 41 edges.

In [2]: np.linspace(0,1,41)                                                     
Out[2]: 
array([0.   , 0.025, 0.05 , 0.075, 0.1  , 0.125, 0.15 , 0.175, 0.2  ,
       0.225, 0.25 , 0.275, 0.3  , 0.325, 0.35 , 0.375, 0.4  , 0.425,
       0.45 , 0.475, 0.5  , 0.525, 0.55 , 0.575, 0.6  , 0.625, 0.65 ,
       0.675, 0.7  , 0.725, 0.75 , 0.775, 0.8  , 0.825, 0.85 , 0.875,
       0.9  , 0.925, 0.95 , 0.975, 1.   ])

In [3]: len(np.linspace(0,1,41))                                                
Out[3]: 41

It seems you're using plt.hist with the idea to put one value into each bin, so simulating a bar plot. As the x-values fall exactly on the bin bounds, due to rounding they might end up in the neighbor bin. That could be mitigated by moving the x-values half a bin width. The simplest is drawing the bars directly.

The following code creates a bar plot with the given data, with each bar at the center of the region it represents. As a check, the bars are measured again at the end and their height displayed.

from  matplotlib.ticker import MultipleLocator
import matplotlib.pyplot as plt
import numpy as np

scores =[8190,666,278,145,113,83,52,48,45,44,45,29,28,45,29,15,16,10,17,7,15,6,10,7,3,5,7,4,2,3,0,1,0,0,0,0,0,0,0,29]
binbounds = np.linspace(0, 1, 41)
rwidth = 0.7
width = binbounds[1] - binbounds[0]
bars = plt.bar(binbounds[:-1] + width / 2, height=scores, width=width * rwidth, align='center')
plt.gca().xaxis.set_major_locator(MultipleLocator(0.1))
plt.gca().xaxis.set_minor_locator(MultipleLocator(0.05))
plt.yscale('log', nonposy='clip')
for rect in bars:
    x, y = rect.get_xy()
    w = rect.get_width()
    h = rect.get_height()
    plt.text(x + w / 2, h, f'{h}\n', ha='center', va='center')
plt.show()

结果图

PS: To see what's happening with the original histogram, just do a test plot without the weights:

plt.hist([i/40 for i in range(40)], bins=np.linspace(0, 1, 41), rwidth=1, ec='k')
plt.plot([i/40 for i in range(40)], [0.5] * 40, 'ro')
plt.xticks(np.linspace(0, 1, 11))

A red dot shows where the x-values are. Some fall into the correct bin, some into the neighbor which suddenly gets 2 values. 没有权重的直方图

To create a histogram with the x-values at the center of each bin:

plt.hist([i/40 + 1/80 for i in range(40)], bins=np.linspace(0, 1, 41), rwidth=1, ec='k')
plt.plot([i/40 + 1/80 for i in range(40)], [0.5] * 40, 'ro')
plt.xticks(np.linspace(0, 1, 11))
plt.yticks([0, 1])

bin 中心的 x 值

The problem is due to the rounding error of np.linspace(0, 1, 11).

bins = []
for abin in np.linspace(0, 1, 41):
    bins.append(abin)

The code above will get

bins = [0.0, 0.025, 0.05, 0.07500000000000001, 0.1, 0.125, 0.15000000000000002, ...] 

,which causes the problem.

However, when you do np.round(np.linspace(0, 1, 41), 4), the problem is fixed.

Example:

plt.hist([i/40 for i in range(40)], bins=np.round(np.linspace(0, 1, 41), 4), rwidth=1, ec='k')
plt.plot([i/40 for i in range(40)], [0.5] * 40, 'ro')
plt.xticks(np.linspace(0, 1, 11))

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM