如何將可變大小的數據保存到 H5PY 文件？

Question

我使用的數據集太大，無法放入 memory 進行計算。 為了規避這個問題，我分批進行計算並再次將結果保存到文件中。

我遇到的問題是我的最后一批不會保存到我的 H5py 文件中，幾乎可以肯定是因為結束的批大小與以前的所有不同。 有什么辦法可以讓chunks更靈活嗎？

考慮以下 MWE：

import h5py
import numpy as np
import pandas as pd
from more_tools import chunked

df = pd.DataFrame({'data': np.random.random(size=113)})
chunk_size = 10
index_chunks = chunked(df.index, chunk_size)

with h5py.File('SO.h5', 'w') as f:
    dset = f.create_dataset('test', shape=(len(df), ), maxshape=(None, ), chunks=True, dtype=np.float32)

    for step, i in enumerate(index_chunks):
        temp_df = df.iloc[i]
        dset = f['test']
        start = step*len(i)
        dset[start:start+len(i)] = temp_df['data']
        dset.attrs['last_index'] = (step+1)*len(i)

# check data
with h5py.File('SO.h5', 'r') as f:
    print('last entry:', f['test'][-10::])  # yields 3 empty values because it did not match the usual batch size

Answer 1

你的索引是錯誤的。 step, i是這樣的：

 0,   0 ...   9
 1,  10 ...  19
 2,  20 ...  29
...
 9,  90 ...  99
10, 100 ... 109
11, 110 ... 112

對於step == 11 ， len(i) == 3 。 這使得start = step * len(i)變為11 * 3 == 33 ，而您期望11 * 10 == 110 。 您只是在寫錯誤的位置。 如果您檢查第四個塊中的數據，您可能會發現第四個、第五個和第六個元素被丟失的數據覆蓋。

這是一個可能的解決方法：

last = 0
for step, i in enumerate(index_chunks):
    temp_df = df.iloc[i]
    dset = f['test']
    first = last
    last = first + len(i)
    dset[first:last] = temp_df['data']
    dset.attrs['last_index'] = last

如何將可變大小的數據保存到 H5PY 文件？

問題描述

1 個解決方案

解決方案1
1 已采納 2021-02-11 17:43:55

如何將可變大小的數據保存到 H5PY 文件？

問題描述

1 個解決方案

解決方案1 1 已采納 2021-02-11 17:43:55

解決方案1
1 已采納 2021-02-11 17:43:55