计算从hdf5文件映射的大型numpy数组的平均值

Question

I've a problem with calculating the mean of an array in numpy that is too large for RAM (~100G). 我在以numpy计算数组的平均值时遇到问题，该数组对于RAM（〜100G）太大。

I've looked into using np.memmap , but unfortunately my array is stored as a dataset in a hdf5 file. 我已经研究过使用np.memmap ，但是不幸的是我的数组作为数据集存储在hdf5文件中。 And based on what I've tried, np.memmap doesn't accept hdf5 datasets as input. 根据我的尝试，np.memmap不接受hdf5数据集作为输入。
TypeError: coercing to Unicode: need string or buffer, Dataset found

So how can I call np.mean on a memory mapped array from disk in an efficient way? 那么，如何有效地从磁盘调用内存映射数组上的np.mean ？ Of course I could loop over parts of the dataset, where each part fits into the memory. 当然，我可以遍历数据集的各个部分，每个部分都适合内存。
However, this feels too much like a workaround and I'm also not sure if it would achieve the best performance. 但是，这感觉太像是一种解决方法，我也不确定它是否可以达到最佳性能。

Here's some sample code: 这是一些示例代码：

data = np.randint(0, 255, 100000*10*10*10, dtype=np.uint8)
data.reshape((100000,10,10,10)) # typically lot larger, ~100G

hdf5_file = h5py.File('data.h5', 'w')
hdf5_file.create_dataset('x', data=data, dtype='uint8')

def get_mean_image(filepath):
    """
    Returns the mean_array of a dataset.
    """
    f = h5py.File(filepath, "r")
    xs_mean = np.mean(f['x'], axis=0) # memory error with large enough array

    return xs_mean

xs_mean = get_mean_image('./data.h5')

Answer 1

As hpaulj suggested in the comments, I've just split the mean calculation into multiple steps. 正如hpaulj在评论中建议的那样，我只是将均值计算分为多个步骤。
Here's my (simplified) code if it may be useful for someone: 如果对某人有用，这是我的（简化）代码：

def get_mean_image(filepath):
    """
    Returns the mean_image of a xs dataset.
    :param str filepath: Filepath of the data upon which the mean_image should be calculated.
    :return: ndarray xs_mean: mean_image of the x dataset. 
    """
    f = h5py.File(filepath, "r")

    # check available memory and divide the mean calculation in steps
    total_memory = 0.5 * psutil.virtual_memory() # In bytes. Take 1/2 of what is available, just to make sure.
    filesize = os.path.getsize(filepath)
    steps = int(np.ceil(filesize/total_memory))
    n_rows = f['x'].shape[0]
    stepsize = int(n_rows / float(steps))

    xs_mean_arr = None
    for i in xrange(steps):
        if xs_mean_arr is None: # create xs_mean_arr that stores intermediate mean_temp results
            xs_mean_arr = np.zeros((steps, ) + f['x'].shape[1:], dtype=np.float64)

        if i == steps-1: # for the last step, calculate mean till the end of the file
            xs_mean_temp = np.mean(f['x'][i * stepsize: n_rows], axis=0, dtype=np.float64)
        else:
            xs_mean_temp = np.mean(f['x'][i*stepsize : (i+1) * stepsize], axis=0, dtype=np.float64)
        xs_mean_arr[i] = xs_mean_temp

    xs_mean = np.mean(xs_mean_arr, axis=0, dtype=np.float64).astype(np.float32)

    return xs_mean

计算从hdf5文件映射的大型numpy数组的平均值

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-10-16 09:51:23

计算从hdf5文件映射的大型numpy数组的平均值

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-10-16 09:51:23

解决方案1
2 已采纳 2017-10-16 09:51:23