如何管理大型h5文件

Question

I have been encountering this issue for a while.我遇到这个问题已经有一段时间了。 I have a h5 file with a dataset "ds", which consists of a matrix of size 170000x70000, float 16. Evidently, I cannot load all of this at once, but I do not need to.我有一个带有数据集“ds”的 h5 文件，它由一个大小为 170000x70000、浮点数为 16 的矩阵组成。显然，我无法一次加载所有这些，但我不需要。 I only need to work with vectors of 170000x1 and filter out the elements that fulfill a condition.我只需要处理 170000x1 的向量并过滤掉满足条件的元素。 However, I am finding this to be very inefficient.但是，我发现这是非常低效的。 I have tried to load it in chunks, like the example below, but this operation takes several minutes.我尝试将其分块加载，如下例所示，但此操作需要几分钟时间。

f=h5py.File(myfile, "r")
ds = f["ds"]
i=1000
for i in tqdm(range(0,170000,100)):
    chunk=ds[i:i+100,j]
    filtered_chunk =np.where(chunk>0)

Has any of you encountered this issue before?你们中有人遇到过这个问题吗？ How can I tackle this?我该如何解决这个问题？

Answer 1

This is how you read vectors of 170000x1.这就是读取 170000x1 向量的方式。 I think addresses your question我想解决了你的问题

f=h5py.File(myfile, "r")
ds = f["ds"]
for cnt in range(ds.shape[1]):
    # get arr as a numpy array of shape (170000,1)
    arr=ds[:,cnt]  
    ...do something with arr

如何管理大型h5文件

问题描述

1 个解决方案

解决方案1
0 2020-11-19 17:11:36

如何管理大型h5文件

问题描述

1 个解决方案

解决方案1 0 2020-11-19 17:11:36

解决方案1
0 2020-11-19 17:11:36