合并大型 h5 数据集

Question

我有 8 个大的 h5 文件（每个约 100G），每个文件都有许多不同的数据集（比如'x'、'y'、'z'、'h'）。 我想将所有 8 个“x”和“y”数据集合并到一个 test.h5 和 train.h5 文件中。 有没有快速的方法来做到这一点？ 我总共有 800080 行所以我首先创建我的火车文件save_file = h5py.File(os.path.join(base_path,'data/train.h5'),'w',libver='latest')并在计算后随机拆分我创建数据集：

train_file.create_dataset('x', (num_train, 256, 256, 1))
train_file.create_dataset('y',(num_train,1))

[similarly for test_file]

train_indeces = np.asarray([1]*num_train + [0]*num_test)
np.random.shuffle(train_indeces)

然后我尝试遍历我的 8 个文件中的每一个并保存训练/测试。

    indeces_index = 0
    last_train_index = 0
    last_test_index = 0
    for e in files:
        print(f'FILE:  {e}')
        rnd_file = h5py.File(f'{base_path}data/{e}', 'r', libver='latest')

        for j in tqdm(range(rnd_file['x'].shape[0] )):
            if train_indeces[indeces_index]==1:
                train_file['x'][last_train_index] = rnd_file['x'][j]
                train_file['y'][last_train_index] = rnd_file['y'][j]
                last_train_index+=1
            else:
                test_file['x'][last_test_index] = rnd_file['x'][j]
                test_file['y'][last_test_index] = rnd_file['y'][j]
                last_test_index +=1

            indeces_index +=1
        rnd_file.close()

但根据我的计算，这需要大约 12 天才能运行。 有没有（更快）的方法来做到这一点？ 提前致谢。

Answer 1

如果我理解你的方法，它有 800,080 次读/写操作。 大量的“写入”正在杀死你。 为了提高性能，您必须重新排序 I/O 操作以每次读取和写入大量数据。

通常我会将整个数据集读入一个数组，然后将其写入新文件。 我通过您的代码阅读并看到您使用train_indeces随机 select 行数据写入train_file或test_file 。 这使事情“有点”复杂化。 :-)

为了复制随机性，我使用np.where()来查找训练和测试行。 然后我使用 NumPy “花式索引”将数据作为数组访问（转换为列表后）。 然后，我将该数组写入相应数据集中的下一个开放槽。 （我重用了你的 3 个计数器： indeces_index 、 last_train_index和last_test_index来跟踪事情。）

我认为这会做你想要的：
[警告：我有 99% 的把握这会奏效，但并未使用真实数据进行测试。]

for e in files:
    print(f'FILE:  {e}')
    rnd_file = h5py.File(f'{base_path}data/{e}', 'r', libver='latest')
    
    rnd_size = rnd_file['x'].shape[0]
    # get an array with the next "rnd_size" indices
    ind_arr = train_indeces[indeces_index:indeces_index+rnd_size]

    # Get training data indices where index==1
    train_idx = np.where(ind_arr==1)[0]  # np.where() returns a tuple
    train_size = len(train_idx)
    
    x_train_arr = rnd_file['x'][train_idx.tolist()]
    train_file['x'][last_train_index:last_train_index+train_size] = x_train_arr
    
    y_train_arr = rnd_file['y'][train_idx.tolist()]
    train_file['y'][last_train_index:last_train_index+train_size] = y_train_arr
    
    # Get test data indices where index==0
    test_idx  = np.where(ind_arr==0)[0]   # np.where() returns a tuple
    test_size = len(test_idx)

    x_test_arr = rnd_file['x'][test_idx.tolist()]
    test_file['x'][last_test_index:last_test_index+test_size] = x_test_arr

    y_test_arr = rnd_file['y'][test_idx.tolist()]
    test_file['y'][last_test_index:last_test_index+test_size] = y_test_arr
    
    indeces_index   += rnd_size 
    last_train_index+= train_size
    last_test_index += test_size
  
    rnd_file.close()

您应该考虑使用 Python 的with/as:上下文管理器打开文件。 用这个：

with h5py.File(f'{base_path}data/{e}', 'r', libver='latest') as rnd_file:

您不需要rnd_file.close与上下文管理器。

而不是这个：

rnd_file = h5py.File(f'{base_path}data/{e}', 'r', libver='latest')

合并大型 h5 数据集

问题描述

1 个解决方案

解决方案1
0 已采纳 2021-04-08 20:17:49

合并大型 h5 数据集

问题描述

1 个解决方案

解决方案1 0 已采纳 2021-04-08 20:17:49

解决方案1
0 已采纳 2021-04-08 20:17:49