简体   繁体   English

如何将大倍数 arrays 分层写入 h5 文件?

[英]How to write large multiple arrays to a h5 file in layers?

Suppose I have 10000 systems.假设我有 10000 个系统。 For each system I have 2 datasets: for each data set I have x,y and y_err arrays. How can I put the data for all the systems into a h5 file, either using h5py or pandas ?对于每个系统,我有 2 个数据集:对于每个数据集,我有 x、y 和 y_err arrays。如何使用h5pypandas将所有系统的数据放入 h5 文件中? Detailed description is given below.下面给出详细描述。

Systems=np.arange(10000)

for sys in Systems:
    x1,y1,y1_err=np.random.rand(100),np.random.rand(100),np.random.rand(100)
    x2,y2,y2_err=np.random.rand(200),np.random.rand(200),np.random.rand(200)

I want to put x1,y1,y1_err,x2,y2,y2_err for all the systems in to a h5 file in a structured manner.我想以结构化方式将所有系统的x1,y1,y1_err,x2,y2,y2_err h5 文件中。

Sorry, this might be very elementary task but I am really struggling.抱歉,这可能是非常基本的任务,但我真的很挣扎。

I think this should work:我认为这应该有效:

df = pd.DataFrame(columns=['system','x1','y1','y1_err','x2','y2','y2_err'])

Systems=np.arange(10000)

for i, sys in enumerate(Systems):
    x1,y1,y1_err=np.random.rand(100),np.random.rand(100),np.random.rand(100)
    x2,y2,y2_err=np.random.rand(200),np.random.rand(200),np.random.rand(200)
    temp = (pd.DataFrame([x1,y1,y1_err,x2,y2,y2_err], index=['x1','y1','y1_err','x2','y2','y2_err'])).transpose()
    temp["system"] = i
    df = pd.concat([df, temp])

df.to_hdf('data.h5', key='key')

Two other methods to create HDF5 files are the h5py and PyTables packages.创建 HDF5 文件的另外两种方法是 h5py 和 PyTables 包。 They are similar but each has unique strengths.它们很相似,但每个都有独特的优势。 The thing I like about both: when you open the HDF5 file with HDFView, you can view the data in a simple table layout (like a spreadsheet).我喜欢两者的一点是:当您使用 HDFView 打开 HDF5 文件时,您可以在简单的表格布局(如电子表格)中查看数据。

I wrote an example for each.我为每个写了一个例子。 Only a 2 functions are different: 1) creating groups with create_group() and creating datasets with h5py create_dataset vs PyTables create_table .只有 2 个函数不同:1) 使用 create_group create_group()创建组并使用 h5py create_dataset与 PyTables create_table创建数据集。 Both use a numpy recarray to name the data columns (aka x1,y1,y1_err ).两者都使用 numpy recarray 来命名数据列(又名x1,y1,y1_err )。 The process is slightly simpler if you don't want to name the columns and all the data is the same type (eg, all floats or all ints).如果您不想命名列并且所有数据都是同一类型(例如,所有浮点数或所有整数),则该过程会稍微简单一些。

Here is the process for h5py:这是h5py的过程:

import h5py
import numpy as np

table1_dt = np.dtype([('x1',float), ('y1',float), ('y1_err',float),])
table2_dt = np.dtype([('x2',float), ('y2',float), ('y2_err',float),])

Systems=np.arange(10_000)

with h5py.File('SO_71335363.h5','w') as h5f:
    
    for sys in Systems:
        grp = h5f.create_group(f'System_{sys:05}')
        x1,y1,y1_err=np.random.rand(100),np.random.rand(100),np.random.rand(100)
        t1_arr = np.empty(dtype=table1_dt,shape=(x1.shape[0],))
        t1_arr['x1'] = x1
        t1_arr['y1'] = y1
        t1_arr['y1_err'] = y1_err       
        grp.create_dataset('table1',data=t1_arr)
        
        x2,y2,y2_err=np.random.rand(200),np.random.rand(200),np.random.rand(200)
        t2_arr = np.empty(dtype=table2_dt,shape=(x2.shape[0],))
        t2_arr['x2'] = x2
        t2_arr['y2'] = y2
        t2_arr['y2_err'] = y2_err       
        grp.create_dataset('table2',data=t2_arr)

Here is the same procedure with PyTables (package is import tables ):这是与 PyTables 相同的过程(包是import tables ):

import tables as tb # (this is PyTables)
import numpy as np

table1_dt = np.dtype([('x1',float), ('y1',float), ('y1_err',float),])
table2_dt = np.dtype([('x2',float), ('y2',float), ('y2_err',float),])

Systems=np.arange(10_000)

with tb.File('SO_71335363_tb.h5','w') as h5f:
    
    for sys in Systems:
        grp = h5f.create_group('/',f'System_{sys:05}')
        x1,y1,y1_err=np.random.rand(100),np.random.rand(100),np.random.rand(100)
        t1_arr = np.empty(dtype=table1_dt,shape=(x1.shape[0],))
        t1_arr['x1'] = x1
        t1_arr['y1'] = y1
        t1_arr['y1_err'] = y1_err       
        h5f.create_table(grp,'table1',obj=t1_arr)
        
        x2,y2,y2_err=np.random.rand(200),np.random.rand(200),np.random.rand(200)
        t2_arr = np.empty(dtype=table2_dt,shape=(x2.shape[0],))
        t2_arr['x2'] = x2
        t2_arr['y2'] = y2
        t2_arr['y2_err'] = y2_err       
        h5f.create_table(grp,'table2',obj=t2_arr)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM