简体   繁体   English

读取具有定义格式的二进制文件的最快方法?

[英]Fastest way to read a binary file with a defined format?

I have large binary data files that have a predefined format, originally written by a Fortran program as little endians.我有具有预定义格式的大型二进制数据文件,最初由 Fortran 程序编写为 little endians。 I would like to read these files in the fastest, most efficient manner, so using the array package seemed right up my alley as suggested in Improve speed of reading and converting from binary file?我想以最快、最有效的方式读取这些文件,因此按照提高读取二进制文件和转换二进制文件的速度中的建议,使用array包似乎正合我意。 . .

The problem is the pre-defined format is non-homogeneous.问题是预定义格式是非同质的。 It looks something like this: ['<2i','<5d','<2i','<d','<i','<3d','<2i','<3d','<i','<d','<i','<3d']它看起来像这样: ['<2i','<5d','<2i','<d','<i','<3d','<2i','<3d','<i','<d','<i','<3d']

with each integer i taking up 4 bytes, and each double d taking 8 bytes.每个整数i占用 4 个字节,每个双精度d占用 8 个字节。

Is there a way I can still use the super efficient array package (or another suggestion) but with the right format?有没有办法我仍然可以使用超级高效的array包(或其他建议)但格式正确?

Use struct .使用struct In particular, struct.unpack .特别是struct.unpack

result = struct.unpack("<2i5d...", buffer)

Here buffer holds the given binary data.这里buffer保存给定的二进制数据。

It's not clear from your question whether you're concerned about the actual file reading speed (and building data structure in memory), or about later data processing speed.从您的问题中不清楚您是否关心实际的文件读取速度(以及在内存中构建数据结构),或者关心以后的数据处理速度。

If you are reading only once, and doing heavy processing later, you can read the file record by record (if your binary data is a recordset of repeated records with identical format), parse it with struct.unpack and append it to a [double] array:如果您只读取一次,然后进行大量处理,则可以逐条读取文件记录(如果您的二进制数据是具有相同格式的重复记录的记录集),使用struct.unpack解析它并将其附加到[double]数组:

from functools import partial

data = array.array('d')
record_size_in_bytes = 9*4 + 16*8   # 9 ints + 16 doubles

with open('input', 'rb') as fin:
    for record in iter(partial(fin.read, record_size_in_bytes), b''):
        values = struct.unpack("<2i5d...", record)
        data.extend(values)

Under assumption you are allowed to cast all your int s to double s and willing to accept increase in allocated memory size (22% increase for your record from the question).假设您被允许将所有int s 转换为double s愿意接受分配内存大小的增加(问题记录增加 22%)。

If you are reading the data from file many times, it could be worthwhile to convert everything to one large array of double s (like above) and write it back to another file from which you can later read with array.fromfile() :如果您多次从文件中读取数据,那么将所有内容转换为一个大型double array (如上)并将其写回到另一个文件中是值得的,您稍后可以使用array.fromfile()读取数据:

data = array.array('d')
with open('preprocessed', 'rb') as fin:
    n = os.fstat(fin.fileno()).st_size // 8
    data.fromfile(fin, n)

Update .更新 Thanks to a nice benchmark by @martineau , now we know for a fact that preprocessing the data and turning it into an homogeneous array of doubles ensures that loading such data from file (with array.fromfile() ) is ~20x to ~40x faster than reading it record-per-record, unpacking and appending to array (as shown in the first code listing above).多亏了@martineau 的一个很好的基准测试,现在我们知道了一个事实,即预处理数据并将其转换为同质双精度数组可确保从文件(使用array.fromfile() )加载此类数据的速度快~20x to ~40x倍而不是逐条记录地读取它,解包并附加到array (如上面的第一个代码清单所示)。

A faster (and a more standard) variation of record-by-record reading in @martineau's answer which appends to list and doesn't upcast to double is only ~6x to ~10x slower than array.fromfile() method and seems like a better reference benchmark. @martineau 的答案中逐条记录读取的更快(和更标准)变化附加到list并且不向上转换为double仅比array.fromfile()方法慢~6x to ~10x并且看起来像一个更好的参考基准。

Major Update: Modified to use proper code for reading in a preprocessed array file (function using_preprocessed_file() below), which dramatically changed the results.主要更新:修改为使用正确的代码读取预处理数组文件(下面的函数using_preprocessed_file() ),这显着改变了结果。

To determine what method is faster in Python (using only built-ins and the standard libraries), I created a script to benchmark (via timeit ) the different techniques that could be used to do this.为了确定 Python 中哪种方法更快(仅使用内置函数和标准库),我创建了一个脚本来对可用于执行此操作的不同技术进行基准测试(通过timeit )。 It's a bit on the longish side, so to avoid distraction, I'm only posting the code tested and related results.它有点偏长,所以为了避免分心,我只发布经过测试的代码和相关结果。 (If there's sufficient interest in the methodology, I'll post the whole script.) (如果对该方法有足够的兴趣,我将发布整个脚本。)

Here are the snippets of code that were compared:以下是比较的代码片段:

@TESTCASE('Read and constuct piecemeal with struct')
def read_file_piecemeal():
    structures = []
    with open(test_filenames[0], 'rb') as inp:
        size = fmt1.size
        while True:
            buffer = inp.read(size)
            if len(buffer) != size:  # EOF?
                break
            structures.append(fmt1.unpack(buffer))
    return structures

@TESTCASE('Read all-at-once, then slice and struct')
def read_entire_file():
    offset, unpack, size = 0, fmt1.unpack, fmt1.size
    structures = []
    with open(test_filenames[0], 'rb') as inp:
        buffer = inp.read()  # read entire file
        while True:
            chunk = buffer[offset: offset+size]
            if len(chunk) != size:  # EOF?
                break
            structures.append(unpack(chunk))
            offset += size

    return structures

@TESTCASE('Convert to array (@randomir part 1)')
def convert_to_array():
    data = array.array('d')
    record_size_in_bytes = 9*4 + 16*8   # 9 ints + 16 doubles (standard sizes)

    with open(test_filenames[0], 'rb') as fin:
        for record in iter(partial(fin.read, record_size_in_bytes), b''):
            values = struct.unpack("<2i5d2idi3d2i3didi3d", record)
            data.extend(values)

    return data

@TESTCASE('Read array file (@randomir part 2)', setup='create_preprocessed_file')
def using_preprocessed_file():
    data = array.array('d')
    with open(test_filenames[1], 'rb') as fin:
        n = os.fstat(fin.fileno()).st_size // 8
        data.fromfile(fin, n)
    return data

def create_preprocessed_file():
    """ Save array created by convert_to_array() into a separate test file. """
    test_filename = test_filenames[1]
    if not os.path.isfile(test_filename):  # doesn't already exist?
        data = convert_to_array()
        with open(test_filename, 'wb') as file:
            data.tofile(file)

And here were the results running them on my system:这是在我的系统上运行它们的结果:

Fastest to slowest execution speeds using Python 3.6.1
(10 executions, best of 3 repetitions)
Size of structure: 164
Number of structures in test file: 40,000
file size: 6,560,000 bytes

     Read array file (@randomir part 2): 0.06430 secs, relative  1.00x (   0.00% slower)
Read all-at-once, then slice and struct: 0.39634 secs, relative  6.16x ( 516.36% slower)
Read and constuct piecemeal with struct: 0.43283 secs, relative  6.73x ( 573.09% slower)
    Convert to array (@randomir part 1): 1.38310 secs, relative 21.51x (2050.87% slower)

Interestingly, most of the snippets are actually faster in Python 2...有趣的是,大多数代码片段在 Python 2 中实际上更快......

Fastest to slowest execution speeds using Python 2.7.13
(10 executions, best of 3 repetitions)
Size of structure: 164
Number of structures in test file: 40,000
file size: 6,560,000 bytes

     Read array file (@randomir part 2): 0.03586 secs, relative  1.00x (   0.00% slower)
Read all-at-once, then slice and struct: 0.27871 secs, relative  7.77x ( 677.17% slower)
Read and constuct piecemeal with struct: 0.40804 secs, relative 11.38x (1037.81% slower)
    Convert to array (@randomir part 1): 1.45830 secs, relative 40.66x (3966.41% slower)

Take a look at the documentation for numpy 's fromfile function: https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.fromfile.html and https://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html#arrays-dtypes-constructing查看numpyfromfile函数的文档: https ://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.fromfile.html 和https://docs.scipy.org /doc/numpy/reference/arrays.dtypes.html#arrays-dtypes-constructing

Simplest example:最简单的例子:

import numpy as np
data = np.fromfile('binary_file', dtype=np.dtype('<i8, ...'))

Read more about "Structured Arrays" in numpy and how to specify their data type(s) here: https://docs.scipy.org/doc/numpy/user/basics.rec.html#在此处详细了解numpy中的“结构化数组”以及如何指定其数据类型: https ://docs.scipy.org/doc/numpy/user/basics.rec.html#

There's a lot of good and helpful answers here, but I think the best solution needs more explaining.这里有很多好的和有用的答案,但我认为最好的解决方案需要更多解释。 I implemented a method that reads the entire data file in one pass using the built-in read() and constructs a numpy ndarray all at the same time.我实现了一种方法,使用内置的read()一次读取整个数据文件,并同时构造一个numpy ndarray This is more efficient than reading the data and constructing the array separately, but it's also a bit more finicky.这比单独读取数据和构造数组更有效,但也更挑剔一点。

line_cols = 20              #For example
line_rows = 40000           #For example
data_fmt = 15*'f8,'+5*'f4,' #For example (15 8-byte doubles + 5 4-byte floats)
data_bsize = 15*8 + 4*5     #For example
with open(filename,'rb') as f:
        data = np.ndarray(shape=(1,line_rows),
                          dtype=np.dtype(data_fmt),
                          buffer=f.read(line_rows*data_bsize))[0].astype(line_cols*'f8,').view(dtype='f8').reshape(line_rows,line_cols)[:,:-1]

Here, we open the file as a binary file using the 'rb' option in open .在这里,我们使用 open 中的'rb'选项将文件作为二进制文件open Then, we construct our ndarray with the proper shape and dtype to fit our read buffer.然后,我们构建具有适当形状和数据类型的ndarray以适合我们的读取缓冲区。 We then reduce the ndarray into a 1D array by taking its zeroth index, where all our data is hiding.然后我们通过获取第零个索引将ndarray为一维数组,我们所有的数据都隐藏在该索引中。 Then, we reshape the array using np.astype , np.view and np.reshape methods.然后,我们使用np.astypenp.viewnp.reshape方法重塑数组。 This is because np.reshape doesn't like having data with mixed dtypes, and I'm okay with having my integers expressed as doubles.这是因为np.reshape不喜欢混合数据类型的数据,而且我可以将整数表示为双精度数。

This method is ~100x faster than looping line-for-line through the data, and could potentially be compressed down into a single line of code.这种方法比逐行循环数据快约 100 倍,并且有可能被压缩成一行代码。

In the future, I may try to read the data in even faster using a Fortran script that essentially converts the binary file into a text file.将来,我可能会尝试使用将二进制文件转换为文本文件的Fortran脚本更快地读取数据。 I don't know if this will be faster, but it may be worth a try.我不知道这是否会更快,但值得一试。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM