从大型数据集中有效地创建二维直方图

Question

我想用存储在HDF5文件中的大型数据集（100000多个样本）在python中创建2d直方图。 我想出了以下代码：

import sys
import h5py
import numpy as np
import matplotlib as mpl
import matplotlib.pylab

f = h5py.File(sys.argv[1], 'r')

A = f['A']
T = f['T']

at_hist, xedges, yedges = np.histogram2d(T, A, bins=500)
extent = [yedges[0], yedges[-1], xedges[0], xedges[-1]]

fig = mpl.pylab.figure()
at_plot = fig.add_subplot(111)

at_plot.imshow(at_hist, extent=extent, origin='lower', aspect='auto')

mpl.pylab.show()

f.close()

执行大约需要15秒（100000个数据点）。 但是，CERN的根（使用其自己的树数据结构而不是HDF5）可以在不到1秒的时间内完成此操作。 您知道我如何加快代码速度吗？ 如果有帮助的话，我也可以更改HDF5数据的结构。

Answer 1

我会尝试一些不同的事情。

从hdf文件加载数据，而不是传递有效的内存映射数组。
如果那不能解决问题，则可以利用scipy.sparse.coo_matrix制作2D直方图。 使用numpy的旧版本时，在某些情况下， digitize （所有各种histogram*函数都在内部使用）可能会占用过多的内存。 不过，最新的（> 1.5 ??）版本的numpy不再是这种情况。

作为第一个建议的示例，您可以执行以下操作：

f = h5py.File(sys.argv[1], 'r')
A = np.empty(f['A'].shape, f['A'].dtype)
T = np.empty(f['T'].shape, f['T'].dtype)
f['A'].read_direct(A)
f['T'].read_direct(T)

此处的区别在于，整个数组将被读取到内存中 ，而不是h5py的类似数组的对象，它们实际上是存储在磁盘上的高效内存映射数组。

至于第二个建议，除非第一个建议不能解决您的问题，否则请不要尝试。

它可能不会显着更快（对于小型数组可能会更慢），并且对于最新版本的numpy，它的内存效率仅略高一点。 我确实有一段代码专门用于执行此操作，但是一般而言，我不建议这样做。 这是一个非常棘手的解决方案。 但是，在非常特殊的情况下（许多点和许多分箱），它的预成型会比histogram2d更好。

除了所有这些注意事项，这里是：

import numpy as np
import scipy.sparse
import timeit

def generate_data(num):
    x = np.random.random(num)
    y = np.random.random(num)
    return x, y

def crazy_histogram2d(x, y, bins=10):
    try:
        nx, ny = bins
    except TypeError:
        nx = ny = bins
    xmin, xmax = x.min(), x.max()
    ymin, ymax = y.min(), y.max()
    dx = (xmax - xmin) / (nx - 1.0)
    dy = (ymax - ymin) / (ny - 1.0)

    weights = np.ones(x.size)

    # Basically, this is just doing what np.digitize does with one less copy
    xyi = np.vstack((x,y)).T
    xyi -= [xmin, ymin]
    xyi /= [dx, dy]
    xyi = np.floor(xyi, xyi).T

    # Now, we'll exploit a sparse coo_matrix to build the 2D histogram...
    grid = scipy.sparse.coo_matrix((weights, xyi), shape=(nx, ny)).toarray()

    return grid, np.linspace(xmin, xmax, nx), np.linspace(ymin, ymax, ny)

if __name__ == '__main__':
    num=1e6
    numruns = 1
    x, y = generate_data(num)
    t1 = timeit.timeit('crazy_histogram2d(x, y, bins=500)',
            setup='from __main__ import crazy_histogram2d, x, y',
            number=numruns)
    t2 = timeit.timeit('np.histogram2d(x, y, bins=500)',
            setup='from __main__ import np, x, y',
            number=numruns)
    print 'Average of %i runs, using %.1e points' % (numruns, num)
    print 'Crazy histogram', t1 / numruns, 'sec'
    print 'numpy.histogram2d', t2 / numruns, 'sec'

在我的系统上，这产生了：

Average of 10 runs, using 1.0e+06 points
Crazy histogram 0.104092288017 sec
numpy.histogram2d 0.686891794205 sec

Answer 2

您需要确定瓶颈是在数据加载中还是在histogram2d 。 尝试在代码中插入一些时间度量。

是A和T数组，还是它们是生成器对象？ 如果是后者，则需要更多注意以区分瓶颈在哪里； 您可能必须先将它们解压缩为numpy数组才能进行测试。

Answer 3

整个过程是在15s内运行还是仅调用histogram2d？ 导入pylab系列可能会花费大量时间。 如果我没记错的话，应在C中实现numpy histogram2d函数，因此我怀疑那里存在性能问题。 您可以通过使用优化标志--OO调用脚本来加快python的运行速度

python -OO script.py

还可以考虑使用Psycho来提高性能。

从大型数据集中有效地创建二维直方图

问题描述

3 个解决方案

解决方案1
14 已采纳 2012-01-10 15:42:05

解决方案2
3 2012-01-10 15:33:06

解决方案3
2 2012-01-10 15:36:38

从大型数据集中有效地创建二维直方图

问题描述

3 个解决方案

解决方案1 14 已采纳 2012-01-10 15:42:05

解决方案2 3 2012-01-10 15:33:06

解决方案3 2 2012-01-10 15:36:38

解决方案1
14 已采纳 2012-01-10 15:42:05

解决方案2
3 2012-01-10 15:33:06

解决方案3
2 2012-01-10 15:36:38