[英]Dask: Read hdf5 and write to other hdf5 file
我正在使用大於內存的 hdf5 文件。 因此,我正在嘗試使用 dask 對其進行修改。 我的目標是加載文件,進行一些修改(不一定保留形狀),並將其保存到其他文件中。 我創建我的文件:
import h5py as h5
import numpy as np
source_file = "source.hdf5"
x = np.zeros((3, 3)) # In practice, x will be larger than memory
with h5.File(source_file, "w") as f:
f.create_dataset("/x", data=x, compression="gzip")
然后,我使用下面的代碼來加載、修改和保存它。
from dask import array as da
import h5py as h5
from dask.distributed import Client
if __name__ == "__main__":
dask_client = Client(n_workers=1) # No need to parallelize, just interested in dask for memory-purposes
source_file = "source.hdf5"
temp_filename = "target.hdf5"
# Load dataframe
f = h5.File(source_file, "r")
x_da = da.from_array(f["/x"])
# Do some modifications
x_da = x_da * 2
# Save to target
x_da.to_hdf5(temp_filename, "/x", compression="gzip")
# Close original file
f.close()
但是,這會產生以下錯誤:
TypeError: ('Could not serialize object of type Dataset.', '<HDF5 dataset "x": shape (3, 3), type "<f8">') distributed.comm.utils - ERROR - ('Could not serialize object of type Dataset.', '<HDF5 dataset "x": shape (3, 3), type "<f8">')
我做錯了什么,或者這根本不可能? 如果是這樣,是否有一些解決方法?
提前致謝!
對於任何感興趣的人,我創建了一個解決方法,它只需在每個塊上調用 compute()。 只是分享它,盡管我仍然對更好的解決方案感興趣。
def to_hdf5(x, filename, datapath):
"""
Appends dask array to hdf5 file
"""
with h5.File(filename, "a") as f:
dset = f.require_dataset(datapath, shape=x.shape, dtype=x.dtype)
for block_ids in product(*[range(num) for num in x.numblocks]):
pos = [sum(x.chunks[dim][0 : block_ids[dim]]) for dim in range(len(block_ids))]
block = x.blocks[block_ids]
slices = tuple(slice(pos[i], pos[i] + block.shape[i]) for i in range(len(block_ids)))
dset[slices] = block.compute()
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.