在Python中读取和切片二进制数据文件的最快方法

Question

I have a processing script that is designed to pull in binary data files of type "uint16" and do various processing in chunks of 6400 at a time. 我有一个处理脚本，用于提取“uint16”类型的二进制数据文件，并一次以6400块的形式进行各种处理。 The code was originally written in Matlab, but because the analysis codes are written in Python we wanted to streamline the process by having everything done in Python. 该代码最初是用Matlab编写的，但由于分析代码是用Python编写的，我们希望通过在Python中完成所有工作来简化流程。 The problem is i've noticed that my Python code is sufficiently slower than Matlab's fread function. 问题是我注意到我的Python代码比Matlab的fread函数慢得多。

Simply put the Matlab code is thus: 简单地说，Matlab代码是这样的：

fid = fopen(filename); 
frame = reshape(fread(fid,80*80,'uint16'),80,80);

While my Python code is simply: 虽然我的Python代码很简单：

with open(filename, 'rb') as f: 
    frame = np.array(unpack("H"*6400, f.read(12800))).reshape(80, 80).astype('float64')

The file size varies heavily from 500 MB -> 400 GB so i believe finding a faster way of parsing the data in Python could pay dividends on the larger files. 文件大小从500 MB - > 400 GB变化很大，所以我相信找到一种更快的方法来解析Python中的数据可以为更大的文件带来好处。 A 500 MB typically has ~50000 chunks and this number increases linearly with file size. 500 MB通常具有~50000个块，并且该数量随文件大小线性增加。 The speed difference i am seeing is roughly: 我看到的速度差异大致是：

Python = 4 x 10^-4 seconds / chunk

Matlab = 6.5 x 10^-5 seconds / chunk

The processing shows over time Matlab is ~5x faster than Python's method i've implemented. 处理显示随着时间的推移，Matlab比我实现的Python方法快约5倍。 I have explored methods such as numpy.fromfile and numpy.memmap, but because these methods require opening the entire file into memory at some point, it limits the use case as my binary files are quite large. 我已经探索了诸如numpy.fromfile和numpy.memmap之类的方法，但是因为这些方法需要在某些时候将整个文件打开到内存中，所以它限制了用例，因为我的二进制文件非常大。 Is there some pythonic method for doing this that i am missing? 有没有一些pythonic方法可以做到这一点，我错过了？ I would have thought Python would be exceptionally fast at opening + reading binary files. 我本以为Python在打开+读取二进制文件方面会非常快。 Any advice is greatly appreciated. 任何意见是极大的赞赏。

Answer 1

Write a chunk to a file: 将块写入文件：

In [117]: dat = np.random.randint(0,1028,80*80).astype(np.uint16)
In [118]: dat.tofile('test.dat')
In [119]: dat
Out[119]: array([266, 776, 458, ..., 519,  38, 840], dtype=uint16)

Import it your way: 按您的方式导入：

In [120]: import struct
In [121]: with open('test.dat','rb') as f:
     ...:     frame = np.array(struct.unpack("H"*6400,f.read(12800)))
     ...:     
In [122]: frame
Out[122]: array([266, 776, 458, ..., 519,  38, 840])

Import with fromfile 使用fromfile导入

In [124]: np.fromfile('test.dat',count=6400,dtype=np.uint16)
Out[124]: array([266, 776, 458, ..., 519,  38, 840], dtype=uint16)

Compare times: 比较时间：

In [125]: %%timeit
     ...:  with open('test.dat','rb') as f:
     ...:      ...:     frame = np.array(struct.unpack("H"*6400,f.read(12800)))
     ...: 
1000 loops, best of 3: 898 µs per loop

In [126]: timeit np.fromfile('test.dat',count=6400,dtype=np.uint16)
The slowest run took 5.41 times longe....
10000 loops, best of 3: 36.6 µs per loop

fromfile is much faster. fromfile要快得多。

Time for the struct.unpack , without np.array is 266 µs; 没有np.array的struct.unpack时间是struct.unpack ; for just the f.read , 23. So it's the unpack plus the more general and robust np.array that take so much longer. 仅仅是f.read ，23。所以它是unpack加上更通用和更健壮的np.array需要更长的时间。 File read, itself, is not a problem. 文件读取本身不是问题。 ( np.array can handle many kinds of input, lists of lists, lists of objects, etc, so has to spend more time parsing and evaluating the inputs.) （ np.array可以处理多种输入，列表列表，对象列表等，因此必须花费更多时间来解析和评估输入。）

A slightly faster variant on fromfile is your read plus frombuffer : fromfile上稍微快一点的变体是你的读取加上frombuffer ：

In [133]: with open('test.dat','rb') as f:
     ...:      frame3 = np.frombuffer(f.read(12800),dtype=np.uint16)

在Python中读取和切片二进制数据文件的最快方法

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-05-25 06:02:15

在Python中读取和切片二进制数据文件的最快方法

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-05-25 06:02:15

解决方案1
2 已采纳 2017-05-25 06:02:15