简体   繁体   English

追加大型 Pytables HDF5 文件的最快方法

[英]Fastest way to Append Large Pytables HDF5 Files

I use multiprocessing to generate numerous really-large Pytables (H5) files--large enough to give memory issues if read in single sweep.我使用多处理来生成大量非常大的 Pytables (H5) 文件——大到足以在单次扫描时出现内存问题。 Each of these files are created using tb.create_table to allow 3 columns with mixed datatypes--first two columns are integers, third has floats (such as here ).这些文件中的每一个都是使用tb.create_table创建的,以允许 3 列具有混合数据类型——前两列是整数,第三列是浮点数(例如这里)。 Total number of rows in each file can be different.每个文件中的总行数可以不同。

I want to combine these H5 files to a single H5 file;我想将这些 H5 文件合并为一个 H5 文件; all separate H5s have datset_1 that need to be appended to a single dataset in the new H5 file.所有单独的 H5 都有datset_1 ,需要将其附加到新 H5 文件中的单个数据集。

I modified the answer given here .我修改了这里给出的答案。 In my case, I read/append each file/dataset in chunks to the combined H5 file.就我而言,我以块的形式读取/附加每个文件/数据集到组合的 H5 文件中。 Wondering if there is computationally faster (or clean) way to do this job?想知道是否有计算速度更快(或干净)的方法来完成这项工作?

The minimal working code and sample output is below where I fetch H5 files from /output/ directory:下面是我从/output/目录获取 H5 文件的最小工作代码和示例输出:

import os
import numpy as np
import tables as tb

# no. of rows to read per chunk
factor = 10**7

# gather files to combine
file_lst = []
for fl in os.listdir('output/'):
    if not fl.startswith('combined'):
        file_lst.append(fl)

# combined file name
file_cmb = tb.open_file('output/combined.h5', 'w')
# copy file-1 dataset to new file
file1 = tb.open_file(f'output/{file_lst[0]}', 'r')
z = file1.copy_node('/', name='dataset_1', newparent=file_cmb.root, newname='dataset_1')
print(f'File-0 shape: {file1.root.dataset_1.shape[0]}')

for file_idx in range(len(file_lst)):
    if file_idx>0:
        file2 = tb.open_file(f'output/{file_lst[file_idx]}', 'r')
        file2_dset = file2.root.dataset_1
        shape = file2_dset.shape[0]
        print(f'File-{file_idx} shape: {shape}')

        # determine number of chunks_loops to read entire file2
        if shape<factor:
            chunk_loop = 1
        else:
            chunk_loop = shape//factor

        size_int = shape//chunk_loop
        size_arr = np.repeat(size_int,chunk_loop)

        if shape%chunk_loop:
            last_size = shape % chunk_loop
            size_arr = np.append(size_arr, last_size)
            chunk_loop += 1

        chunk_start = 0
        chunk_end = 0

        for alpha in range(size_arr.shape[0]):
            chunk_end = chunk_end + size_arr[alpha]
            z.append(file2_dset[chunk_start:chunk_end])
            chunk_start = chunk_start + size_arr[alpha]
        file2.close()

print(f'Combined file shape: {z.shape}')
file1.close()
file_cmb.close()

Sample output:示例输出:

File-0 shape: 787552
File-1 shape: 56743654
File-2 shape: 56743654
File-3 shape: 56743654
Combined file shape: (171018514,)

You have the right idea.你有正确的想法。 I prefer context managers for file handling and the logic to loop and make incremental copies was hard to follow (and you don't need arrays - you can do the calculations on the fly).我更喜欢用于文件处理的上下文管理器,并且循环和制作增量副本的逻辑很难遵循(并且您不需要数组 - 您可以即时进行计算)。 I took at stab at refactoring.我开始尝试重构。 However, without the input files, I couldn't debug, so there may be minor errors.但是,没有输入文件,我无法调试,因此可能会出现小错误。

# no. of rows to read per chunk
factor = 10**7

# gather files to combine
file_lst = []
for fl in os.listdir('output/'):
    if not fl.startswith('combined'):
        file_lst.append(fl)

# combined file name
with tb.File('output/combined.h5', 'w') as file_cmb:
    for file_idx, filename in enumerate(file_lst):
        if file_idx == 0:
    # copy file-1 dataset to new file
            with tb.File(f'output/{filename}', 'r') as file1:
                z = file1.copy_node('/', name='dataset_1', newparent=file_cmb.root, newname='dataset_1')
                print(f'File1-{filename} shape: {file1.root.dataset_1.shape[0]}')
        
        else:
            with tb.File(f'output/{filename}', 'r') as file2:
                file2_dset = file2.root.dataset_1
                shape = file2_dset.shape[0]
                print(f'File2-{filename} shape: {shape}')
        
                chunk_loops = shape//factor
                if shape > chunk_loops*factor:
                    chunk_loops += 1
                
                chunk_start, chunk_end = 0, 0
                for alpha in range(chunk_loops):                   
                    if chunk_start + factor > shape:
                        chunk_end = shape
                    else:
                        chunk_end = chunk_start + factor
                        
                    z.append(file2_dset[chunk_start:chunk_end])
                    chunk_start = chunk_end
                       
    print(f'Combined file shape: {z.shape}')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM