Dask：读取 hdf5 并写入其他 hdf5 文件

Question

I am working with a hdf5 file that is larger than memory.我正在使用大于内存的 hdf5 文件。 Therefore, I'm trying to use dask to modify it.因此，我正在尝试使用 dask 对其进行修改。 My goal is to load the file, do some modifications (not necessarily preserving shape), and saving it to some other file.我的目标是加载文件，进行一些修改（不一定保留形状），并将其保存到其他文件中。 I create my file with:我创建我的文件：

import h5py as h5
import numpy as np

source_file = "source.hdf5"
x = np.zeros((3, 3))  # In practice, x will be larger than memory
with h5.File(source_file, "w") as f:
    f.create_dataset("/x", data=x, compression="gzip")

Then, I use the following code to load, modify and save it.然后，我使用下面的代码来加载、修改和保存它。

from dask import array as da
import h5py as h5
from dask.distributed import Client


if __name__ == "__main__":
    dask_client = Client(n_workers=1)  # No need to parallelize, just interested in dask for memory-purposes

    source_file = "source.hdf5"
    temp_filename = "target.hdf5"

    # Load dataframe
    f = h5.File(source_file, "r")
    x_da = da.from_array(f["/x"])

    # Do some modifications
    x_da = x_da * 2

    # Save to target
    x_da.to_hdf5(temp_filename, "/x", compression="gzip")

    # Close original file
    f.close()

However, this gives the following error:但是，这会产生以下错误：

TypeError: ('Could not serialize object of type Dataset.', '<HDF5 dataset "x": shape (3, 3), type "<f8">') distributed.comm.utils - ERROR - ('Could not serialize object of type Dataset.', '<HDF5 dataset "x": shape (3, 3), type "<f8">')

Am I doing something wrong, or is this simply not possible?我做错了什么，或者这根本不可能？ And if so, is there some workaround?如果是这样，是否有一些解决方法？

Thanks in advance!提前致谢！

Answer 1

For anyone interested, I created a workaround which simply calls compute() on each block.对于任何感兴趣的人，我创建了一个解决方法，它只需在每个块上调用 compute()。 Just sharing it, although I'm still interested in a better solution.只是分享它，尽管我仍然对更好的解决方案感兴趣。

def to_hdf5(x, filename, datapath):
    """
    Appends dask array to hdf5 file
    """
    with h5.File(filename, "a") as f:
        dset = f.require_dataset(datapath, shape=x.shape, dtype=x.dtype)

        for block_ids in product(*[range(num) for num in x.numblocks]):
            pos = [sum(x.chunks[dim][0 : block_ids[dim]]) for dim in range(len(block_ids))]
            block = x.blocks[block_ids]
            slices = tuple(slice(pos[i], pos[i] + block.shape[i]) for i in range(len(block_ids)))
            dset[slices] = block.compute()

Dask：读取 hdf5 并写入其他 hdf5 文件

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-07-07 12:24:12

Dask：读取 hdf5 并写入其他 hdf5 文件

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-07-07 12:24:12

解决方案1
1 已采纳 2022-07-07 12:24:12