简体   繁体   English

如何将可变大小的数据保存到 H5PY 文件?

[英]How to save variable-sized data to H5PY file?

The data set that I am using is too large to fit into memory to do computations.我使用的数据集太大,无法放入 memory 进行计算。 To circumvent this issue, I am doing the computations in batches and again saving the results to file.为了规避这个问题,我分批进行计算并再次将结果保存到文件中。

The problem that I have is that my last batch will not be saved to my H5py file, almost certainly because the ending batch size differs from all the previous.我遇到的问题是我的最后一批不会保存到我的 H5py 文件中,几乎可以肯定是因为结束的批大小与以前的所有不同。 Is there any way I can get chunks to be more flexible?有什么办法可以让chunks更灵活吗?

Consider the following MWE:考虑以下 MWE:

import h5py
import numpy as np
import pandas as pd
from more_tools import chunked

df = pd.DataFrame({'data': np.random.random(size=113)})
chunk_size = 10
index_chunks = chunked(df.index, chunk_size)

with h5py.File('SO.h5', 'w') as f:
    dset = f.create_dataset('test', shape=(len(df), ), maxshape=(None, ), chunks=True, dtype=np.float32)

    for step, i in enumerate(index_chunks):
        temp_df = df.iloc[i]
        dset = f['test']
        start = step*len(i)
        dset[start:start+len(i)] = temp_df['data']
        dset.attrs['last_index'] = (step+1)*len(i)
# check data
with h5py.File('SO.h5', 'r') as f:
    print('last entry:', f['test'][-10::])  # yields 3 empty values because it did not match the usual batch size

Your indexing is wrong.你的索引是错误的。 step, i goes like this: step, i是这样的:

 0,   0 ...   9
 1,  10 ...  19
 2,  20 ...  29
...
 9,  90 ...  99
10, 100 ... 109
11, 110 ... 112

For step == 11 , len(i) == 3 .对于step == 11len(i) == 3 That makes start = step * len(i) into 11 * 3 == 33 , while you're expecting 11 * 10 == 110 .这使得start = step * len(i)变为11 * 3 == 33 ,而您期望11 * 10 == 110 You're simply writing to the wrong location.您只是在写错误的位置。 If you inspect the data in the fourth chunk, you will likely find that the fourth, fifth and sixth elements are overwritten by the missing data.如果您检查第四个块中的数据,您可能会发现第四个、第五个和第六个元素被丢失的数据覆盖。

Here is a possible workaround:这是一个可能的解决方法:

last = 0
for step, i in enumerate(index_chunks):
    temp_df = df.iloc[i]
    dset = f['test']
    first = last
    last = first + len(i)
    dset[first:last] = temp_df['data']
    dset.attrs['last_index'] = last

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM