简体   繁体   English

python - 将numpy数组保存到文件(尽可能小的大小)

[英]python - saving numpy array to a file (smallest size possible)

Right now I have a python program building a fairly large 2D numpy array and saving it as a tab delimited text file using numpy.savetxt. 现在我有一个python程序构建一个相当大的2D numpy数组并使用numpy.savetxt将其保存为制表符分隔的文本文件。 The numpy array contains only floats. numpy数组只包含浮点数。 I then read the file in one row at a time in a separate C++ program. 然后,我在一个单独的C ++程序中一次读取一行文件。

What I would like to do is find a way to accomplish this same task, changing my code as little as possible such that I can decrease the size of the file I am passing between the two programs. 我想要做的是找到一种方法来完成同样的任务,尽可能少地改变我的代码,这样我就可以减少我在两个程序之间传递的文件的大小。

I found that I can use numpy.savetxt to save to a compressed .gz file instead of a text file. 我发现我可以使用numpy.savetxt保存为压缩的.gz文件而不是文本文件。 This lowers the file size from ~2MB to ~100kB. 这将文件大小从大约2MB降低到大约100kB。

Is there a better way to do this? 有一个更好的方法吗? Could I, perhaps, write the numpy array in binary to the file to save space? 或许,我可以将二进制的numpy数组写入文件以节省空间吗? If so, how would I do this so that I can still read it into the C++ program? 如果是这样,我将如何做到这一点,以便我仍然可以将它读入C ++程序?

Thank you for the help. 感谢您的帮助。 I appreciate any guidance I can get. 我很感激能得到的任何指导。

EDIT: 编辑:

There are a lot of zeros (probably 70% of the values in the numpy array are 0.0000) I am not sure of how I can somehow exploit this though and generate a tiny file that my c++ program can read in 有很多零(numpy数组中可能有70%的值是0.0000)我不知道我怎么能以某种方式利用它并生成一个我的c ++程序可以读取的小文件

Unless you are sure you don't need to worry about endianness and such, best use numpy.savez , as explained in @unutbu's answer and @jorgeca's comment here: numpy's tostring/fromstring --- what do I need to specify to restore the array . 除非你确定你不需要担心字节顺序等,最好使用numpy.savez ,如@ unutbu的回答和@jorgeca的评论中所述: numpy的tostring / fromstring ---我需要指定什么才能恢复阵列

If the resulting size is not small enough, there's always zlib (on python's side: import zlib , on the C++ side, I'm sure an implementation exists). 如果结果大小不够小,那么总是有zlib (在python方面: import zlib ,在C ++方面,我确定存在一个实现)。

An alternative would be to use hdf5 format: while it does not necessarily reduce the on-disk file size, it does make saving/loading faster (this is what the format was designed for, large data arrays). 另一种方法是使用hdf5格式:虽然它不一定会减少磁盘上的文件大小,但它确实可以更快地保存/加载(这就是格式设计的大数据阵列)。 There are both python and C++ readers/writers for hdf5 . hdf5有python和C ++读/写hdf5

Since you have a lot of zeroes, you could only write out the non-zero elements in the form (index, number). 由于你有很多零,你只能写出表格中的非零元素(索引,数字)。

Suppose you have an array with a small amount of nonzero numbers: 假设您有一个包含少量非零数字的数组:

In [5]: a = np.zeros((10, 10))

In [6]: a
Out[6]: 
array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])

In [7]: a[3,1] = 2.0

In [8]: a[7,4] = 17.0

In [9]: a[9,0] = 1.5

First, isolate the interesting numbers and their indices: 首先,隔离有趣的数字及其指数:

In [11]: x, y = a.nonzero()

In [12]: zip(x,y)
Out[12]: [(3, 1), (7, 4), (9, 0)]

In [13]: nonzero = zip(x,y)

Now you only have a small number of data elements left. 现在您只剩下少量数据元素。 The easiest thing is to write them to a text file: 最简单的方法是将它们写入文本文件:

In [17]: with open('numbers.txt', 'w+') as outf:
   ....:     for r, k in nonzero:
   ....:         outf.write('{:d} {:d} {:g}\n'.format(r, k, a[r,k]))
   ....:         

In [18]: cat numbers.txt
3 1 2
7 4 17
9 0 1.5

This also gives you an opportunity to eyeball the data. 这也让您有机会了解数据。 In your C++ program you can read this data with fscanf . 在C ++程序中,您可以使用fscanf读取此数据。

But you can reduce the size even more by writing binary data using struct : 但是您可以使用struct编写二进制数据来进一步减小大小:

In [17]: import struct

In [19]: c = struct.Struct('=IId')

In [20]: with open('numbers.bin', 'w+') as outf:
   ....:     for r, k in nonzero:
   ....:         outf.write(c.pack(r, k, a[r,k]))

The argument to the Struct constructor means; Struct构造函数的参数意味着; use native date format '='. 使用原生日期格式'='。 The first and second data elements are unsigned integers 'I', the third element is a double 'd'. 第一个和第二个数据元素是无符号整数“I”,第三个元素是双“d”。

In your C++ program this data is probably best read as binary data into a packed struct . 在您的C ++程序中,这些数据可能最好作为二进制数据读入打包struct

EDIT : Answer updated for a 2D array. 编辑 :为2D阵列更新了答案。

numpy.ndarray.tofile and numpy.fromfile are useful for direct binary output/input from python. numpy.ndarray.tofilenumpy.fromfile对于python的直接二进制输出/输入很有用。 std::ostream::write std::istream::read are useful for binary output/input in c++. std::ostream::write std::istream::read对于c ++中的二进制输出/输入很有用。

You should be careful about endianess if the data are transferred from one machine to another. 如果数据从一台机器传输到另一台机器,您应该注意字节顺序。

Use the an hdf5 file, they are really simple to use through h5py and you can use set compression a flag. 使用一个hdf5文件,它们通过h5py非常简单,你可以使用set compression一个标志。 Note that hdf5 has also a c++ interface. 请注意,hdf5还有一个c ++接口。

如果您不介意安装其他软件包(对于pythonc++ ),您可以使用[BSON][1] (二进制JSON)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM