[英]write heterogeneous numpy arrays to binary files
I have a large number n
of 3x3-matrices, vectors of length 3, and ints which I need to write to a file in a given binary format.我有大量
n
个 3x3 矩阵、长度为 3 的向量和整数,我需要以给定的二进制格式写入文件。 I could easily enough use for
loop to fh.write()
the items one after another, but this is slow.我可以很容易地使用
for
循环来一个接一个地fh.write()
项目,但这很慢。 An alternative is to copy the data into an array with a special dtype.另一种方法是将数据复制到具有特殊 dtype 的数组中。 This is much faster, but creates a prohibitively large copy in memory:
这要快得多,但会在 memory 中创建一个非常大的副本:
import numpy as np
n = 100 # a large number
A = np.random.rand(n, 3, 3)
b = np.random.rand(n, 3)
c = np.ones(n, dtype=int)
# slow
with open("out.dat", "wb") as fh:
for a_, b_, c_ in zip(A, b, c):
fh.write(a_)
fh.write(b_)
fh.write(c_)
# memory-consuming
dtype = np.dtype([
('A', ('<f', (3, 3))),
('b', ('<f', 3)),
('c', '<H'),
])
data = np.empty(n, dtype=dtype)
data["A"] = A
data["b"] = b
data["c"] = c
with open("out.dat", "wb") as fh:
data.tofile(fh)
Is there a fast, memory-efficient alternative here?这里有快速、高效的替代方案吗?
Please consider that your first and second version leads to different results.请考虑您的第一个和第二个版本会导致不同的结果。 I will focus on the second version here.
我将在这里重点介绍第二个版本。 This version does compared to block-wise writing not only have a significant memory overhead, but is also slower than splitting the process in multiple chunks.
与逐块写入相比,此版本不仅具有显着的 memory 开销,而且比将进程拆分为多个块要慢。
Example例子
def write_method_2(file_name,A,b,c):
n=A.shape[0]
dtype = np.dtype([
('A', ('<f', (3, 3))),
('b', ('<f', 3)),
('c', '<H'),
])
data = np.empty(n, dtype=dtype)
data["A"] = A
data["b"] = b
data["c"] = c
with open(file_name, "wb") as fh:
data.tofile(fh)
The only drawback is longer code... With a generator function it should be also possible to generalize this for multiple IO operations.唯一的缺点是代码较长......使用生成器 function 也应该可以将其推广到多个 IO 操作。
def write_method_3(file_name,A,b,c):
n=A.shape[0]
blk_size=10_000
dtype = np.dtype([
('A', ('<f', (3, 3))),
('b', ('<f', 3)),
('c', '<H'),
])
data = np.empty(blk_size, dtype=dtype)
with open(file_name, "wb") as fh:
#write block-wise
n_full_blocks=n//blk_size
for i in range(n_full_blocks):
data["A"] = A[i*blk_size:i*blk_size+blk_size]
data["b"] = b[i*blk_size:i*blk_size+blk_size]
data["c"] = c[i*blk_size:i*blk_size+blk_size]
data.tofile(fh)
#write remainder
n_full_blocks=n//blk_size
data=data[:n-n_full_blocks*blk_size]
data["A"] = A[n_full_blocks*blk_size:]
data["b"] = b[n_full_blocks*blk_size:]
data["c"] = c[n_full_blocks*blk_size:]
data.tofile(fh)
Edit编辑
This is a more general way to write data from several nd-arrays to a file using a non-simple datatype.这是使用非简单数据类型将数据从多个 nd 数组写入文件的更通用方法。
def write_method_3_gen(fh,dtype,tuple_of_arr,blk_size=500_000):
"""
fh file-handle
dtype some non-simple dtype
tuple_of_arr tuple of arrays
blk_size size of a block, default 0.5MB
"""
n=tuple_of_arr[0].shape[0]
blk_size=blk_size//dtype.itemsize
data = np.empty(blk_size, dtype=dtype)
#write block-wise
n_full_blocks=n//blk_size
for i in range(n_full_blocks):
for j in range(len(tuple_of_arr)):
data[keys[j]] = tuple_of_arr[j][i*blk_size:i*blk_size+blk_size]
data.tofile(fh)
#write remainder
n_full_blocks=n//blk_size
data=data[:n-n_full_blocks*blk_size]
for j in range(len(tuple_of_arr)):
data[keys[j]] = tuple_of_arr[j][n_full_blocks*blk_size:]
data.tofile(fh)
Timings计时
import numpy as np
import time
n = 10_000_000 # a large number
A = np.random.rand(n, 3, 3)
b = np.random.rand(n, 3)
c = np.ones(n, dtype=int)
t1=time.time()
write_method_2("out_2.dat",A,b,c)
print(time.time()-t1)
#3.7440097332000732
#with blk_size=10_000 this has only 0.5MB memory overhead,
#which stays constant, even on much larger examples
t1=time.time()
write_method_3("out_3.dat",A,b,c)
print(time.time()-t1)
#0.8538124561309814
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.