简体   繁体   English

将异构 numpy arrays 写入二进制文件

[英]write heterogeneous numpy arrays to binary files

I have a large number n of 3x3-matrices, vectors of length 3, and ints which I need to write to a file in a given binary format.我有大量n个 3x3 矩阵、长度为 3 的向量和整数,我需要以给定的二进制格式写入文件。 I could easily enough use for loop to fh.write() the items one after another, but this is slow.我可以很容易地使用for循环来一个接一个地fh.write()项目,但这很慢。 An alternative is to copy the data into an array with a special dtype.另一种方法是将数据复制到具有特殊 dtype 的数组中。 This is much faster, but creates a prohibitively large copy in memory:这要快得多,但会在 memory 中创建一个非常大的副本:

import numpy as np

n = 100  # a large number
A = np.random.rand(n, 3, 3)
b = np.random.rand(n, 3)
c = np.ones(n, dtype=int)

# slow
with open("out.dat", "wb") as fh:
    for a_, b_, c_ in zip(A, b, c):
        fh.write(a_)
        fh.write(b_)
        fh.write(c_)

# memory-consuming
dtype = np.dtype([
  ('A', ('<f', (3, 3))),
  ('b', ('<f', 3)),
  ('c', '<H'),
])
data = np.empty(n, dtype=dtype)
data["A"] = A
data["b"] = b
data["c"] = c
with open("out.dat", "wb") as fh:
    data.tofile(fh)

Is there a fast, memory-efficient alternative here?这里有快速、高效的替代方案吗?

Writing files block-wise逐块写入文件

Please consider that your first and second version leads to different results.请考虑您的第一个和第二个版本会导致不同的结果。 I will focus on the second version here.我将在这里重点介绍第二个版本。 This version does compared to block-wise writing not only have a significant memory overhead, but is also slower than splitting the process in multiple chunks.与逐块写入相比,此版本不仅具有显着的 memory 开销,而且比将进程拆分为多个块要慢。

Example例子

def write_method_2(file_name,A,b,c):
    n=A.shape[0]

    dtype = np.dtype([
      ('A', ('<f', (3, 3))),
      ('b', ('<f', 3)),
      ('c', '<H'),
    ])

    data = np.empty(n, dtype=dtype)
    data["A"] = A
    data["b"] = b
    data["c"] = c
    with open(file_name, "wb") as fh:
        data.tofile(fh)

The only drawback is longer code... With a generator function it should be also possible to generalize this for multiple IO operations.唯一的缺点是代码较长......使用生成器 function 也应该可以将其推广到多个 IO 操作。

def write_method_3(file_name,A,b,c):
    n=A.shape[0]
    blk_size=10_000

    dtype = np.dtype([
      ('A', ('<f', (3, 3))),
      ('b', ('<f', 3)),
      ('c', '<H'),
    ])

    data = np.empty(blk_size, dtype=dtype)
    with open(file_name, "wb") as fh:
        #write block-wise
        n_full_blocks=n//blk_size
        for i in range(n_full_blocks):
            data["A"] = A[i*blk_size:i*blk_size+blk_size]
            data["b"] = b[i*blk_size:i*blk_size+blk_size]
            data["c"] = c[i*blk_size:i*blk_size+blk_size]
            data.tofile(fh)
        #write remainder
        n_full_blocks=n//blk_size
        data=data[:n-n_full_blocks*blk_size]
        data["A"] = A[n_full_blocks*blk_size:]
        data["b"] = b[n_full_blocks*blk_size:]
        data["c"] = c[n_full_blocks*blk_size:]
        data.tofile(fh)

Edit编辑

This is a more general way to write data from several nd-arrays to a file using a non-simple datatype.这是使用非简单数据类型将数据从多个 nd 数组写入文件的更通用方法。

def write_method_3_gen(fh,dtype,tuple_of_arr,blk_size=500_000):
    """
    fh             file-handle
    dtype          some non-simple dtype
    tuple_of_arr   tuple of arrays
    blk_size       size of a block, default 0.5MB
    """
    n=tuple_of_arr[0].shape[0]
    blk_size=blk_size//dtype.itemsize
    data = np.empty(blk_size, dtype=dtype)

    #write block-wise
    n_full_blocks=n//blk_size
    for i in range(n_full_blocks):
        for j in range(len(tuple_of_arr)):
            data[keys[j]] = tuple_of_arr[j][i*blk_size:i*blk_size+blk_size]
        data.tofile(fh)

    #write remainder
    n_full_blocks=n//blk_size
    data=data[:n-n_full_blocks*blk_size]
    for j in range(len(tuple_of_arr)):
        data[keys[j]] = tuple_of_arr[j][n_full_blocks*blk_size:]
    data.tofile(fh)

Timings计时

import numpy as np
import time

n = 10_000_000  # a large number
A = np.random.rand(n, 3, 3)
b = np.random.rand(n, 3)
c = np.ones(n, dtype=int)

t1=time.time()
write_method_2("out_2.dat",A,b,c)
print(time.time()-t1)
#3.7440097332000732

#with blk_size=10_000 this has only 0.5MB memory overhead, 
#which stays constant, even on much larger examples
t1=time.time()
write_method_3("out_3.dat",A,b,c)
print(time.time()-t1)
#0.8538124561309814

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM