在Python中从大型Binary文件读取特定数据块的最快方法是什么

Question

I have a sensor unit which generates data in large binary files. 我有一个传感器单元，可以生成大型二进制文件中的数据。 File sizes can run into several tens of Gigabytes. 文件大小可能达到几十个千兆字节。 I need to: 我需要：

Read the data. 读取数据。
Process it to extract the necessary information that I want. 处理它以提取我想要的必要信息。
Display / Visualize the data. 显示/可视化数据。

Data in the binary file is formatted as: Single precision float ie numpy.float32 二进制文件中的数据格式为：单精度浮点数，即numpy.float32

I have written the code which is working well. 我已经编写了运行良好的代码。 I am now looking to optimize it for time. 我现在正在寻找优化时间的方法。 I observe that it is taking a very large time to read the binary data. 我发现读取二进制数据花了很长时间。 The following is what I have right now : 以下是我现在所拥有的：

def get_data(n):
'''
Function to get relevant trace data from the data file.
Usage :
    get_data(n)
    where n is integer containing relevant trace number to be read
Return :
    data_array : Python array containing single wavelength data.
''' 
with open(data_file, 'rb') as fid:
    data_array = list(np.fromfile(fid, np.float32)[n*no_of_points_per_trace:(no_of_points_per_trace*(n+1))])
return data_array

This allows me to iterate the value for n and obtain different traces ie chunks of data. 这使我可以迭代n的值并获得不同的迹线，即数据块。 The variable no_of_points_per_trace contains the number of points in every trace as the name suggests. no_of_points_per_trace ，变量no_of_points_per_trace包含每个跟踪中的点数。 I am obtaining this from a separate .info file. 我是从单独的.info文件中获得的。

Is there an optimal way to do this? 有没有最佳的方法来做到这一点？

Answer 1

Right now you are reading the whole file into memory when you do np.fromfile(fid, np.float32) . 现在，当您执行np.fromfile(fid, np.float32)时，您正在将整个文件读入内存。 If that fits and you want to access a significant number of traces (if you're calling your function with lots of different values for n ), your only big speedup is to avoid reading it multiple times. 如果合适，并且您想访问大量的跟踪信息（如果您使用n许多不同值调用函数），那么唯一的大提速就是避免多次读取它。 So perhaps you might want to read the whole file and then have your function just index into that: 因此，也许您可能想读取整个文件，然后将函数索引到其中：

# just once:
with open(data_file, 'rb') as fid:
    alldata = list(np.fromfile(fid, np.float32)

# then use this function
def get_data(alldata, n):
    return alldata[n*no_of_points_per_trace:(no_of_points_per_trace*(n+1))])

Now, if you find yourself needing only one or two traces from the big file, you can seek into it and just read the part you want: 现在，如果发现自己只需要从大文件中提取一条或两条迹线，则可以查找该文件，然后阅读所需的部分：

def get_data(n):
    dtype = np.float32
    with open(data_file, 'rb') as fid:
        fid.seek(dtype().itemsize*no_of_points_per_trace*n)
        data_array = np.fromfile(fid, dtype, count=no_of_points_per_trace)
    return data_array

You will notice I have skipped converting to list. 您会注意到我已经跳过了转换为列表的操作。 This is a slow step and probably not required for your workflow. 这是一个缓慢的步骤，可能对您的工作流不是必需的。

在Python中从大型Binary文件读取特定数据块的最快方法是什么

问题描述

1 个解决方案

解决方案1
1 已采纳 2018-07-25 14:13:33

在Python中从大型Binary文件读取特定数据块的最快方法是什么

问题描述

1 个解决方案

解决方案1 1 已采纳 2018-07-25 14:13:33

解决方案1
1 已采纳 2018-07-25 14:13:33