如何使用`np.fromfile`从二进制文件读取连续数组？

Question

I want to read a binary file in Python, the exact layout of which is stored in the binary file itself. 我想用Python读取二进制文件，其确切布局存储在二进制文件本身中。

The file contains a sequence of two-dimensional arrays, with the row and column dimensions of each array stored as a pair of integers preceding its contents. 该文件包含一个二维数组序列，每个数组的行和列尺寸在其内容之前存储为一对整数。 I want to successively read all of the arrays contained within the file. 我想连续读取文件中包含的所有数组。

I know this can be done with f = open("myfile", "rb") and f.read(numberofbytes) , but this is quite clumsy because I would then need to convert the output into meaningful data structures. 我知道可以使用f = open("myfile", "rb")和f.read(numberofbytes)此操作，但是这很笨拙，因为随后我需要将输出转换为有意义的数据结构。 I would like to use numpy's np.fromfile with a custom dtype , but have not found a way to read part of the file, leaving it open, and then continue reading with a modified dtype . 我想用numpy的的np.fromfile与自定义dtype ，但还没有找到一种方法来读取文件的一部分，离开它打开，然后继续修改读取dtype 。

I know I can use os to f.seek(numberofbytes, os.SEEK_SET) and np.fromfile multiple times, but this would mean a lot of unnecessary jumping around in the file. 我知道我可以f.seek(numberofbytes, os.SEEK_SET)使用os来f.seek(numberofbytes, os.SEEK_SET)和np.fromfile ，但这将意味着在文件中不必要的跳转。

In short, I want MATLAB's fread (or at least something like C++ ifstream read ). 简而言之，我想要MATLAB的fread （或至少类似C ++ ifstream read东西）。

What is the best way to do this? 做这个的最好方式是什么？

Answer 1

You can pass an open file object to np.fromfile , read the dimensions of the first array, then read the array contents (again using np.fromfile ), and repeat the process for additional arrays within the same file. 您可以将打开的文件对象传递给np.fromfile ，读取第一个数组的尺寸，然后读取数组的内容（再次使用np.fromfile ），并对同一文件中的其他数组重复该过程。

For example: 例如：

import numpy as np
import os

def iter_arrays(fname, array_ndim=2, dim_dtype=np.int, array_dtype=np.double):

    with open(fname, 'rb') as f:
        fsize = os.fstat(f.fileno()).st_size

        # while we haven't yet reached the end of the file...
        while f.tell() < fsize:

            # get the dimensions for this array
            dims = np.fromfile(f, dim_dtype, array_ndim)

            # get the array contents
            yield np.fromfile(f, array_dtype, np.prod(dims)).reshape(dims)

Example usage: 用法示例：

# write some random arrays to an example binary file
x = np.random.randn(100, 200)
y = np.random.randn(300, 400)

with open('/tmp/testbin', 'wb') as f:
    np.array(x.shape).tofile(f)
    x.tofile(f)
    np.array(y.shape).tofile(f)
    y.tofile(f)

# read the contents back
x1, y1 = iter_arrays('/tmp/testbin')

# check that they match the input arrays
assert np.allclose(x, x1) and np.allclose(y, y1)

If the arrays are large, you might consider using np.memmap with the offset= parameter in place of np.fromfile to get the contents of the arrays as memory-maps rather than loading them into RAM. 如果阵列很大，可以考虑使用np.memmap与offset=代替参数np.fromfile得到数组的内容作为存储器映射，而不是将它们载入RAM。

如何使用`np.fromfile`从二进制文件读取连续数组？

问题描述

1 个解决方案

解决方案1
4 已采纳 2015-07-04 00:04:04

如何使用`np.fromfile`从二进制文件读取连续数组？

问题描述

1 个解决方案

解决方案1 4 已采纳 2015-07-04 00:04:04

解决方案1
4 已采纳 2015-07-04 00:04:04