What is the fastest way to read a specific chunk of data from a large Binary file in Python

Question

I have a sensor unit which generates data in large binary files. File sizes can run into several tens of Gigabytes. I need to:

Read the data.
Process it to extract the necessary information that I want.
Display / Visualize the data.

Data in the binary file is formatted as: Single precision float ie numpy.float32

I have written the code which is working well. I am now looking to optimize it for time. I observe that it is taking a very large time to read the binary data. The following is what I have right now :

def get_data(n):
'''
Function to get relevant trace data from the data file.
Usage :
    get_data(n)
    where n is integer containing relevant trace number to be read
Return :
    data_array : Python array containing single wavelength data.
''' 
with open(data_file, 'rb') as fid:
    data_array = list(np.fromfile(fid, np.float32)[n*no_of_points_per_trace:(no_of_points_per_trace*(n+1))])
return data_array

This allows me to iterate the value for n and obtain different traces ie chunks of data. The variable no_of_points_per_trace contains the number of points in every trace as the name suggests. I am obtaining this from a separate .info file.

Is there an optimal way to do this?

Answer 1

Right now you are reading the whole file into memory when you do np.fromfile(fid, np.float32) . If that fits and you want to access a significant number of traces (if you're calling your function with lots of different values for n ), your only big speedup is to avoid reading it multiple times. So perhaps you might want to read the whole file and then have your function just index into that:

# just once:
with open(data_file, 'rb') as fid:
    alldata = list(np.fromfile(fid, np.float32)

# then use this function
def get_data(alldata, n):
    return alldata[n*no_of_points_per_trace:(no_of_points_per_trace*(n+1))])

Now, if you find yourself needing only one or two traces from the big file, you can seek into it and just read the part you want:

def get_data(n):
    dtype = np.float32
    with open(data_file, 'rb') as fid:
        fid.seek(dtype().itemsize*no_of_points_per_trace*n)
        data_array = np.fromfile(fid, dtype, count=no_of_points_per_trace)
    return data_array

You will notice I have skipped converting to list. This is a slow step and probably not required for your workflow.

What is the fastest way to read a specific chunk of data from a large Binary file in Python

Question

1 answers

solution1
1 ACCPTED 2018-07-25 14:13:33

What is the fastest way to read a specific chunk of data from a large Binary file in Python

Question

1 answers

solution1 1 ACCPTED 2018-07-25 14:13:33

solution1
1 ACCPTED 2018-07-25 14:13:33