Python/PyTables：数组的不同列是否可以有不同的数据类型？

Question

我创建了一个 Nx4 列的可扩展耳阵列。 有些列需要 float64 数据类型，其他列可以使用 int32 进行管理。 是否可以改变列之间的数据类型？ 现在我只为所有文件使用一个（float64，如下），但它需要巨大的磁盘空间来存储（> 10 GB）文件。

例如，如何确保第 1-2 列元素为 int32 且第 3-4 列元素为 float64 ？

import tables
f1 = tables.open_file("table.h5", "w")
a = f1.create_earray(f1.root, "dataset_1", atom=tables.Float32Atom(), shape=(0, 4))

这是我如何使用 Earray 附加的简单版本：

Matrix = np.ones(shape=(10**6, 4))

if counter <= 10**6: # keep appending to Matrix until 10**6 rows
    Matrix[s:s+length, 0:4] = chunk2[left:right] # chunk2 is input np.ndarray
    s += length

# save to disk when rows = 10**6
if counter > 10**6:
    a.append(Matrix[:s])  
    del Matrix
    Matrix = np.ones(shape=(10**6, 4))

以下方法的缺点是什么？

import tables as tb
import numpy as np

filename = 'foo.h5'
f = tb.open_file(filename, mode='w')
int_app = f.create_earray(f.root, "col1", atom=tb.Int32Atom(), shape=(0,2), chunkshape=(3,2))
float_app = f.create_earray(f.root, "col2", atom=tb.Float64Atom(), shape=(0,2), chunkshape=(3,2))

# array containing ints..in reality it will be 10**6x2
arr1 = np.array([[1, 1],
                [2, 2],
                [3, 3]], dtype=np.int32)

# array containing floats..in reality it will be 10**6x2
arr2 = np.array([[1.1,1.2],
                 [1.1,1.2],
                 [1.1,1.2]], dtype=np.float64)

for i in range(3):
    int_app.append(arr1)
    float_app.append(arr2)

f.close()

print('\n*********************************************************')
print("\t\t Reading Now=> ")
print('*********************************************************')
c = tb.open_file('foo.h5', mode='r')
chunks1 = c.root.col1
chunks2 = c.root.col2
chunk1 = chunks1.read()
chunk2 = chunks2.read()
print(chunk1)
print(chunk2)

Answer 1

不，是。 所有 PyTables 数组类型（Array、CArray、EArray、VLArray）都用于同类数据类型（类似于 NumPy ndarray）。 如果要混合数据类型，则需要使用表。 桌子是可扩展的； 他们有一个.append()方法来添加数据行。

创建过程类似于这个答案（只是 dtype 不同）： PyTables create_array fails to save numpy array 。 您只需为一行定义数据类型。 您没有定义行的形状或数量。 当您向表中添加数据时，这是隐含的。 如果您已经在 NumPy recarray 中拥有数据，则可以使用description=条目引用它，并且该表将使用表的 dtype 并填充数据。 更多信息在这里： PyTables 表 Class

您的代码看起来像这样：

import tables as tb
import numpy as np
table_dt = np.dtype(
           {'names': ['int1', 'int2', 'float1', 'float2'], 
            'formats': [int, int, float, float] } )
# Create some random data:
i1 = np.random.randint(0,1000, (10**6,) )
i2 = np.random.randint(0,1000, (10**6,) )
f1 = np.random.rand(10**6)
f2 = np.random.rand(10**6)

with tb.File('table.h5', 'w') as h5f:
    a = h5f.create_table('/', 'dataset_1', description=table_dt)

# Method 1 to create empty recarray 'Matrix', then add data:     
    Matrix = np.recarray( (10**6,), dtype=table_dt)
    Matrix['int1'] = i1
    Matrix['int2'] = i2
    Matrix['float1'] = f1
    Matrix['float2'] = f2        
# Append Matrix to the table
    a.append(Matrix)

# Method 2 to create recarray 'Matrix' with data in 1 step:       
    Matrix = np.rec.fromarrays([i1, i2, f1, f2], dtype=table_dt)
# Append Matrix to the table
    a.append(Matrix)

您提到创建一个非常大的文件，但没有说明有多少行（显然超过 10**6）。 以下是基于另一个线程中的评论的一些额外想法。

.create_table()方法有一个可选参数： expectedrows= 。 此参数用于“优化 HDF5 B 树和使用的 memory 的数量”。 默认值在tables/parameters.py中设置（查找EXPECTED_ROWS_TABLE ；在我的安装中它只有 10000。）如果您要创建 10**6（或更多）行，我强烈建议您将其设置为更大的值。

此外，您应该考虑文件压缩。 有一个权衡：压缩会减小文件大小，但会降低 I/O 性能（增加访问时间）。 有几个选项：

创建文件时启用压缩（创建文件时添加filters=参数）。 从tb.Filters(complevel=1)开始。
使用 HDF 组实用程序h5repack - 针对 HDF5 文件运行以创建新文件（对 go 从未压缩到压缩很有用，反之亦然）。
使用 PyTables 实用程序ptrepack - 与h4repack类似并随 PyTables 一起提供。

我倾向于使用我经常使用的未压缩文件以获得最佳 I/O 性能。 然后完成后，我将其转换为压缩格式以进行长期存档。

Python/PyTables：数组的不同列是否可以有不同的数据类型？

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-08-20 01:32:11

Python/PyTables：数组的不同列是否可以有不同的数据类型？

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-08-20 01:32:11

解决方案1
1 已采纳 2020-08-20 01:32:11