简体   繁体   中英

python - saving numpy array to a file (smallest size possible)

Right now I have a python program building a fairly large 2D numpy array and saving it as a tab delimited text file using numpy.savetxt. The numpy array contains only floats. I then read the file in one row at a time in a separate C++ program.

What I would like to do is find a way to accomplish this same task, changing my code as little as possible such that I can decrease the size of the file I am passing between the two programs.

I found that I can use numpy.savetxt to save to a compressed .gz file instead of a text file. This lowers the file size from ~2MB to ~100kB.

Is there a better way to do this? Could I, perhaps, write the numpy array in binary to the file to save space? If so, how would I do this so that I can still read it into the C++ program?

Thank you for the help. I appreciate any guidance I can get.

EDIT:

There are a lot of zeros (probably 70% of the values in the numpy array are 0.0000) I am not sure of how I can somehow exploit this though and generate a tiny file that my c++ program can read in

Unless you are sure you don't need to worry about endianness and such, best use numpy.savez , as explained in @unutbu's answer and @jorgeca's comment here: numpy's tostring/fromstring --- what do I need to specify to restore the array .

If the resulting size is not small enough, there's always zlib (on python's side: import zlib , on the C++ side, I'm sure an implementation exists).

An alternative would be to use hdf5 format: while it does not necessarily reduce the on-disk file size, it does make saving/loading faster (this is what the format was designed for, large data arrays). There are both python and C++ readers/writers for hdf5 .

Since you have a lot of zeroes, you could only write out the non-zero elements in the form (index, number).

Suppose you have an array with a small amount of nonzero numbers:

In [5]: a = np.zeros((10, 10))

In [6]: a
Out[6]: 
array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])

In [7]: a[3,1] = 2.0

In [8]: a[7,4] = 17.0

In [9]: a[9,0] = 1.5

First, isolate the interesting numbers and their indices:

In [11]: x, y = a.nonzero()

In [12]: zip(x,y)
Out[12]: [(3, 1), (7, 4), (9, 0)]

In [13]: nonzero = zip(x,y)

Now you only have a small number of data elements left. The easiest thing is to write them to a text file:

In [17]: with open('numbers.txt', 'w+') as outf:
   ....:     for r, k in nonzero:
   ....:         outf.write('{:d} {:d} {:g}\n'.format(r, k, a[r,k]))
   ....:         

In [18]: cat numbers.txt
3 1 2
7 4 17
9 0 1.5

This also gives you an opportunity to eyeball the data. In your C++ program you can read this data with fscanf .

But you can reduce the size even more by writing binary data using struct :

In [17]: import struct

In [19]: c = struct.Struct('=IId')

In [20]: with open('numbers.bin', 'w+') as outf:
   ....:     for r, k in nonzero:
   ....:         outf.write(c.pack(r, k, a[r,k]))

The argument to the Struct constructor means; use native date format '='. The first and second data elements are unsigned integers 'I', the third element is a double 'd'.

In your C++ program this data is probably best read as binary data into a packed struct .

EDIT : Answer updated for a 2D array.

numpy.ndarray.tofile and numpy.fromfile are useful for direct binary output/input from python. std::ostream::write std::istream::read are useful for binary output/input in c++.

You should be careful about endianess if the data are transferred from one machine to another.

Use the an hdf5 file, they are really simple to use through h5py and you can use set compression a flag. Note that hdf5 has also a c++ interface.

如果您不介意安装其他软件包(对于pythonc++ ),您可以使用[BSON][1] (二进制JSON)。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM