python - 将numpy数组保存到文件（尽可能小的大小）

Question

Right now I have a python program building a fairly large 2D numpy array and saving it as a tab delimited text file using numpy.savetxt. 现在我有一个python程序构建一个相当大的2D numpy数组并使用numpy.savetxt将其保存为制表符分隔的文本文件。 The numpy array contains only floats. numpy数组只包含浮点数。 I then read the file in one row at a time in a separate C++ program. 然后，我在一个单独的C ++程序中一次读取一行文件。

What I would like to do is find a way to accomplish this same task, changing my code as little as possible such that I can decrease the size of the file I am passing between the two programs. 我想要做的是找到一种方法来完成同样的任务，尽可能少地改变我的代码，这样我就可以减少我在两个程序之间传递的文件的大小。

I found that I can use numpy.savetxt to save to a compressed .gz file instead of a text file. 我发现我可以使用numpy.savetxt保存为压缩的.gz文件而不是文本文件。 This lowers the file size from ~2MB to ~100kB. 这将文件大小从大约2MB降低到大约100kB。

Is there a better way to do this? 有一个更好的方法吗？ Could I, perhaps, write the numpy array in binary to the file to save space? 或许，我可以将二进制的numpy数组写入文件以节省空间吗？ If so, how would I do this so that I can still read it into the C++ program? 如果是这样，我将如何做到这一点，以便我仍然可以将它读入C ++程序？

Thank you for the help. 感谢您的帮助。 I appreciate any guidance I can get. 我很感激能得到的任何指导。

EDIT: 编辑：

There are a lot of zeros (probably 70% of the values in the numpy array are 0.0000) I am not sure of how I can somehow exploit this though and generate a tiny file that my c++ program can read in 有很多零（numpy数组中可能有70％的值是0.0000）我不知道我怎么能以某种方式利用它并生成一个我的c ++程序可以读取的小文件

Answer 1

Unless you are sure you don't need to worry about endianness and such, best use numpy.savez , as explained in @unutbu's answer and @jorgeca's comment here: numpy's tostring/fromstring --- what do I need to specify to restore the array . 除非你确定你不需要担心字节顺序等，最好使用numpy.savez ，如@ unutbu的回答和@jorgeca的评论中所述： numpy的tostring / fromstring ---我需要指定什么才能恢复阵列。

If the resulting size is not small enough, there's always zlib (on python's side: import zlib , on the C++ side, I'm sure an implementation exists). 如果结果大小不够小，那么总是有zlib （在python方面： import zlib ，在C ++方面，我确定存在一个实现）。

An alternative would be to use hdf5 format: while it does not necessarily reduce the on-disk file size, it does make saving/loading faster (this is what the format was designed for, large data arrays). 另一种方法是使用hdf5格式：虽然它不一定会减少磁盘上的文件大小，但它确实可以更快地保存/加载（这就是格式设计的大数据阵列）。 There are both python and C++ readers/writers for hdf5 . hdf5有python和C ++读/写hdf5 。

Answer 2

Since you have a lot of zeroes, you could only write out the non-zero elements in the form (index, number). 由于你有很多零，你只能写出表格中的非零元素（索引，数字）。

Suppose you have an array with a small amount of nonzero numbers: 假设您有一个包含少量非零数字的数组：

In [5]: a = np.zeros((10, 10))

In [6]: a
Out[6]: 
array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])

In [7]: a[3,1] = 2.0

In [8]: a[7,4] = 17.0

In [9]: a[9,0] = 1.5

First, isolate the interesting numbers and their indices: 首先，隔离有趣的数字及其指数：

In [11]: x, y = a.nonzero()

In [12]: zip(x,y)
Out[12]: [(3, 1), (7, 4), (9, 0)]

In [13]: nonzero = zip(x,y)

Now you only have a small number of data elements left. 现在您只剩下少量数据元素。 The easiest thing is to write them to a text file: 最简单的方法是将它们写入文本文件：

In [17]: with open('numbers.txt', 'w+') as outf:
   ....:     for r, k in nonzero:
   ....:         outf.write('{:d} {:d} {:g}\n'.format(r, k, a[r,k]))
   ....:         

In [18]: cat numbers.txt
3 1 2
7 4 17
9 0 1.5

This also gives you an opportunity to eyeball the data. 这也让您有机会了解数据。 In your C++ program you can read this data with fscanf . 在C ++程序中，您可以使用fscanf读取此数据。

But you can reduce the size even more by writing binary data using struct : 但是您可以使用struct编写二进制数据来进一步减小大小：

In [17]: import struct

In [19]: c = struct.Struct('=IId')

In [20]: with open('numbers.bin', 'w+') as outf:
   ....:     for r, k in nonzero:
   ....:         outf.write(c.pack(r, k, a[r,k]))

The argument to the Struct constructor means; Struct构造函数的参数意味着; use native date format '='. 使用原生日期格式'='。 The first and second data elements are unsigned integers 'I', the third element is a double 'd'. 第一个和第二个数据元素是无符号整数“I”，第三个元素是双“d”。

In your C++ program this data is probably best read as binary data into a packed struct . 在您的C ++程序中，这些数据可能最好作为二进制数据读入打包struct 。

EDIT : Answer updated for a 2D array. 编辑：为2D阵列更新了答案。

Answer 3

numpy.ndarray.tofile and numpy.fromfile are useful for direct binary output/input from python. numpy.ndarray.tofile和numpy.fromfile对于python的直接二进制输出/输入很有用。 std::ostream::write std::istream::read are useful for binary output/input in c++. std::ostream::write std::istream::read对于c ++中的二进制输出/输入很有用。

You should be careful about endianess if the data are transferred from one machine to another. 如果数据从一台机器传输到另一台机器，您应该注意字节顺序。

Answer 4

Use the an hdf5 file, they are really simple to use through h5py and you can use set compression a flag. 使用一个hdf5文件，它们通过h5py非常简单，你可以使用set compression一个标志。 Note that hdf5 has also a c++ interface. 请注意，hdf5还有一个c ++接口。

Answer 5

如果您不介意安装其他软件包（对于python和c++ ），您可以使用[BSON][1] （二进制JSON）。

python - 将numpy数组保存到文件（尽可能小的大小）

问题描述

5 个解决方案

解决方案1
3 2013-03-12 19:44:19

解决方案2
3 已采纳 2013-03-12 20:46:13

解决方案3
1 2013-03-12 19:18:34

解决方案4
1 2013-10-07 14:02:46

解决方案5
0 2013-03-12 19:16:27

python - 将numpy数组保存到文件（尽可能小的大小）

问题描述

5 个解决方案

解决方案1 3 2013-03-12 19:44:19

解决方案2 3 已采纳 2013-03-12 20:46:13

解决方案3 1 2013-03-12 19:18:34

解决方案4 1 2013-10-07 14:02:46

解决方案5 0 2013-03-12 19:16:27

解决方案1
3 2013-03-12 19:44:19

解决方案2
3 已采纳 2013-03-12 20:46:13

解决方案3
1 2013-03-12 19:18:34

解决方案4
1 2013-10-07 14:02:46

解决方案5
0 2013-03-12 19:16:27