Fastest way to read in and slice binary data files in Python

Question

I have a processing script that is designed to pull in binary data files of type "uint16" and do various processing in chunks of 6400 at a time. The code was originally written in Matlab, but because the analysis codes are written in Python we wanted to streamline the process by having everything done in Python. The problem is i've noticed that my Python code is sufficiently slower than Matlab's fread function.

Simply put the Matlab code is thus:

fid = fopen(filename); 
frame = reshape(fread(fid,80*80,'uint16'),80,80);

While my Python code is simply:

with open(filename, 'rb') as f: 
    frame = np.array(unpack("H"*6400, f.read(12800))).reshape(80, 80).astype('float64')

The file size varies heavily from 500 MB -> 400 GB so i believe finding a faster way of parsing the data in Python could pay dividends on the larger files. A 500 MB typically has ~50000 chunks and this number increases linearly with file size. The speed difference i am seeing is roughly:

Python = 4 x 10^-4 seconds / chunk

Matlab = 6.5 x 10^-5 seconds / chunk

The processing shows over time Matlab is ~5x faster than Python's method i've implemented. I have explored methods such as numpy.fromfile and numpy.memmap, but because these methods require opening the entire file into memory at some point, it limits the use case as my binary files are quite large. Is there some pythonic method for doing this that i am missing? I would have thought Python would be exceptionally fast at opening + reading binary files. Any advice is greatly appreciated.

Answer 1

Write a chunk to a file:

In [117]: dat = np.random.randint(0,1028,80*80).astype(np.uint16)
In [118]: dat.tofile('test.dat')
In [119]: dat
Out[119]: array([266, 776, 458, ..., 519,  38, 840], dtype=uint16)

Import it your way:

In [120]: import struct
In [121]: with open('test.dat','rb') as f:
     ...:     frame = np.array(struct.unpack("H"*6400,f.read(12800)))
     ...:     
In [122]: frame
Out[122]: array([266, 776, 458, ..., 519,  38, 840])

Import with fromfile

In [124]: np.fromfile('test.dat',count=6400,dtype=np.uint16)
Out[124]: array([266, 776, 458, ..., 519,  38, 840], dtype=uint16)

Compare times:

In [125]: %%timeit
     ...:  with open('test.dat','rb') as f:
     ...:      ...:     frame = np.array(struct.unpack("H"*6400,f.read(12800)))
     ...: 
1000 loops, best of 3: 898 µs per loop

In [126]: timeit np.fromfile('test.dat',count=6400,dtype=np.uint16)
The slowest run took 5.41 times longe....
10000 loops, best of 3: 36.6 µs per loop

fromfile is much faster.

Time for the struct.unpack , without np.array is 266 µs; for just the f.read , 23. So it's the unpack plus the more general and robust np.array that take so much longer. File read, itself, is not a problem. ( np.array can handle many kinds of input, lists of lists, lists of objects, etc, so has to spend more time parsing and evaluating the inputs.)

A slightly faster variant on fromfile is your read plus frombuffer :

In [133]: with open('test.dat','rb') as f:
     ...:      frame3 = np.frombuffer(f.read(12800),dtype=np.uint16)

Fastest way to read in and slice binary data files in Python

Question

1 answers

solution1
2 ACCPTED 2017-05-25 06:02:15

Fastest way to read in and slice binary data files in Python

Question

1 answers

solution1 2 ACCPTED 2017-05-25 06:02:15

solution1
2 ACCPTED 2017-05-25 06:02:15