简体   繁体   English

在Python中从大型Binary文件读取特定数据块的最快方法是什么

[英]What is the fastest way to read a specific chunk of data from a large Binary file in Python

I have a sensor unit which generates data in large binary files. 我有一个传感器单元,可以生成大型二进制文件中的数据。 File sizes can run into several tens of Gigabytes. 文件大小可能达到几十个千兆字节。 I need to: 我需要:

  1. Read the data. 读取数据。
  2. Process it to extract the necessary information that I want. 处理它以提取我想要的必要信息。
  3. Display / Visualize the data. 显示/可视化数据。

Data in the binary file is formatted as: Single precision float ie numpy.float32 二进制文件中的数据格式为:单精度浮点数,即numpy.float32

I have written the code which is working well. 我已经编写了运行良好的代码。 I am now looking to optimize it for time. 我现在正在寻找优化时间的方法。 I observe that it is taking a very large time to read the binary data. 我发现读取二进制数据花了很长时间。 The following is what I have right now : 以下是我现在所拥有的:

def get_data(n):
'''
Function to get relevant trace data from the data file.
Usage :
    get_data(n)
    where n is integer containing relevant trace number to be read
Return :
    data_array : Python array containing single wavelength data.
''' 
with open(data_file, 'rb') as fid:
    data_array = list(np.fromfile(fid, np.float32)[n*no_of_points_per_trace:(no_of_points_per_trace*(n+1))])
return data_array

This allows me to iterate the value for n and obtain different traces ie chunks of data. 这使我可以迭代n的值并获得不同的迹线,即数据块。 The variable no_of_points_per_trace contains the number of points in every trace as the name suggests. no_of_points_per_trace ,变量no_of_points_per_trace包含每个跟踪中的点数。 I am obtaining this from a separate .info file. 我是从单独的.info文件中获得的。

Is there an optimal way to do this? 有没有最佳的方法来做到这一点?

Right now you are reading the whole file into memory when you do np.fromfile(fid, np.float32) . 现在,当您执行np.fromfile(fid, np.float32)时,您正在将整个文件读入内存。 If that fits and you want to access a significant number of traces (if you're calling your function with lots of different values for n ), your only big speedup is to avoid reading it multiple times. 如果合适,并且您想访问大量的跟踪信息(如果您使用n许多不同值调用函数),那么唯一的大提速就是避免多次读取它。 So perhaps you might want to read the whole file and then have your function just index into that: 因此,也许您可​​能想读取整个文件,然后将函数索引到其中:

# just once:
with open(data_file, 'rb') as fid:
    alldata = list(np.fromfile(fid, np.float32)

# then use this function
def get_data(alldata, n):
    return alldata[n*no_of_points_per_trace:(no_of_points_per_trace*(n+1))])

Now, if you find yourself needing only one or two traces from the big file, you can seek into it and just read the part you want: 现在,如果发现自己只需要从大文件中提取一条或两条迹线,则可以查找该文件,然后阅读所需的部分:

def get_data(n):
    dtype = np.float32
    with open(data_file, 'rb') as fid:
        fid.seek(dtype().itemsize*no_of_points_per_trace*n)
        data_array = np.fromfile(fid, dtype, count=no_of_points_per_trace)
    return data_array

You will notice I have skipped converting to list. 您会注意到我已经跳过了转换为列表的操作。 This is a slow step and probably not required for your workflow. 这是一个缓慢的步骤,可能对您的工作流不是必需的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 用 Python 读取大型二进制文件的最快方法 - Fastest way to read a large binary file with Python 在python中解析大型二进制文件的最快方法 - fastest way to parse large binary file in python 读取文本列的大型数据文件的最快方法是什么? - What is the fastest way to read in a large data file of text columns? 从多个文件中读取大数据并在python中聚合数据的最快方法是什么? - What is the fastest way to read large data from multiple files and aggregate data in python? 读取大型二进制文件python的最有效方法是什么 - What is the most efficient way to read a large binary file python 在Python中读取和切片二进制数据文件的最快方法 - Fastest way to read in and slice binary data files in Python 使用 Python 在大型文本文件中查找和替换特定行的最快方法 - Fastest way to find and replace specific line in a large text file with Python 在Python中按字节读取二进制文件的最快方法 - The fastest way to read binary files by bytes in Python Python正则表达式从二进制文件中提取数据块 - Python Regular Expression Extract Chunk of Data From Binary File 读取具有定义格式的二进制文件的最快方法? - Fastest way to read a binary file with a defined format?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM