简体   繁体   English

使用matplotlib的多个并排直方图?

[英]Multiple side-by-side histograms with matplotlib?

I have a piece of software that has to process lots of different data and can take a varying amount of time to process it. 我有一个软件必须处理大量不同的数据,并且可能需要不同的时间来处理它。 As the software gets revised, the time needed to process the data changes, and so I want to create a graph that shows the variance in time as well as outliers, because ideally, this program should take about the same amount of time for each piece of data (It sounds strange and unrealistic, I know, but just roll with me here). 随着软件的修改,处理数据所需的时间也会发生变化,所以我想创建一个显示时间差异和异常值的图表,因为理想情况下,这个程序每个部分需要大约相同的时间数据(这听起来很奇怪,不切实际,我知道,但在这里和我一起滚动)。

At first, I thought about using box plots, but I thought they were inadequate because it is entirely possible to have half of a data set hovered around one value, with the other half hovered around another, and I didn't feel a box plot would illustrate that well. 起初,我想过使用箱形图,但我认为它们是不合适的,因为完全有可能将一半数据集悬停在一个值附近,另一半围绕另一个徘徊,我感觉不到箱形图会很好地说明这一点。 So I decided to try using a histogram, but I can't figure out how to get matplotlib to draw it the way I want it. 所以我决定尝试使用直方图,但我无法弄清楚如何让matplotlib以我想要的方式绘制它。 I want a single figure, the X-axis being labeled with software versions, the Y-axis showing time taken to process a data set, with multiple histograms, like this mockup I made: 我想要一个单独的数字,X轴标有软件版本,Y轴显示处理数据集所需的时间,有多个直方图,就像我做的这个模型:

在此输入图像描述

This graph would show that in version 0.1, most data sets were processed in 2-4 seconds, with a bunch of sets for some reason taking 12 seconds. 该图表显示在0.1版本中,大多数数据集在2-4秒内处理,由于某种原因需要12秒才能处理一组数据集。 v0.1a got rid of those long outliers, but everything took longer. v0.1a摆脱了那些长的异常值,但一切都花了更长的时间。 0.1b is just slighty fast than 0.1a. 0.1b比0.1a略快。 Finally, 0.2 shows much speed improvement, but introduced outliers again. 最后,0.2显示了很大的速度提升,但又引入了异常值。

How can I get matplotlib to create a plot like that? 我怎样才能让matplotlib创建这样的情节?

Here is a (very) basic mockup of how this can be achieved: 这是一个(非常)基本的模型,说明如何实现这一目标:

import matplotlib.pyplot as plt
import numpy as np

number_of_bins = 20
number_of_data_points = 1000

ax = plt.subplot(111)

data_set = [np.random.normal(0, 1, number_of_data_points),
            np.random.normal(6, 1, number_of_data_points),
            np.random.normal(-3, 1, number_of_data_points)]

MID_VALUES = [0, 200, 400]
labels = ["v1", "v2", "v3"]


for MID_VAL, y in zip(MID_VALUES, data_set):

    hist, bin_edges = np.histogram(y, bins=number_of_bins)

    bottom = bin_edges[:-1]
    heights = np.diff(bin_edges)
    lefts = MID_VAL - .5 * hist

    ax.barh(bottom, hist, height=heights, left=lefts)

ax.set_xticks(MID_VALUES)
ax.set_xticklabels(labels)

plt.show()

在此输入图像描述

This lacks a lot of refinement I admit, for example: the MID_VALUES are chosen by hand,this will depend on the data set and could be automated. 这个缺乏我承认的很多改进,例如: MID_VALUES是手动选择的,这取决于数据集并且可以自动化。 Nevertheless, you may be able to get it into a more usable format. 不过,您可以将其变为更有用的格式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM