[英]What is the fastest way to read a specific chunk of data from a large Binary file in Python
I have a sensor unit which generates data in large binary files. 我有一个传感器单元,可以生成大型二进制文件中的数据。 File sizes can run into several tens of Gigabytes. 文件大小可能达到几十个千兆字节。 I need to: 我需要:
Data in the binary file is formatted as: Single precision float ie numpy.float32
二进制文件中的数据格式为:单精度浮点数,即numpy.float32
I have written the code which is working well. 我已经编写了运行良好的代码。 I am now looking to optimize it for time. 我现在正在寻找优化时间的方法。 I observe that it is taking a very large time to read the binary data. 我发现读取二进制数据花了很长时间。 The following is what I have right now : 以下是我现在所拥有的:
def get_data(n):
'''
Function to get relevant trace data from the data file.
Usage :
get_data(n)
where n is integer containing relevant trace number to be read
Return :
data_array : Python array containing single wavelength data.
'''
with open(data_file, 'rb') as fid:
data_array = list(np.fromfile(fid, np.float32)[n*no_of_points_per_trace:(no_of_points_per_trace*(n+1))])
return data_array
This allows me to iterate the value for n and obtain different traces ie chunks of data. 这使我可以迭代n的值并获得不同的迹线,即数据块。 The variable no_of_points_per_trace
contains the number of points in every trace as the name suggests. no_of_points_per_trace
,变量no_of_points_per_trace
包含每个跟踪中的点数。 I am obtaining this from a separate .info file. 我是从单独的.info文件中获得的。
Is there an optimal way to do this? 有没有最佳的方法来做到这一点?
Right now you are reading the whole file into memory when you do np.fromfile(fid, np.float32)
. 现在,当您执行np.fromfile(fid, np.float32)
时,您正在将整个文件读入内存。 If that fits and you want to access a significant number of traces (if you're calling your function with lots of different values for n
), your only big speedup is to avoid reading it multiple times. 如果合适,并且您想访问大量的跟踪信息(如果您使用n
许多不同值调用函数),那么唯一的大提速就是避免多次读取它。 So perhaps you might want to read the whole file and then have your function just index into that: 因此,也许您可能想读取整个文件,然后将函数索引到其中:
# just once:
with open(data_file, 'rb') as fid:
alldata = list(np.fromfile(fid, np.float32)
# then use this function
def get_data(alldata, n):
return alldata[n*no_of_points_per_trace:(no_of_points_per_trace*(n+1))])
Now, if you find yourself needing only one or two traces from the big file, you can seek into it and just read the part you want: 现在,如果发现自己只需要从大文件中提取一条或两条迹线,则可以查找该文件,然后阅读所需的部分:
def get_data(n):
dtype = np.float32
with open(data_file, 'rb') as fid:
fid.seek(dtype().itemsize*no_of_points_per_trace*n)
data_array = np.fromfile(fid, dtype, count=no_of_points_per_trace)
return data_array
You will notice I have skipped converting to list. 您会注意到我已经跳过了转换为列表的操作。 This is a slow step and probably not required for your workflow. 这是一个缓慢的步骤,可能对您的工作流不是必需的。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.