简体   繁体   English

用python读取二进制文件

[英]Reading a binary file with python

I find particularly difficult reading binary file with Python.我发现用 Python 读取二进制文件特别困难。 Can you give me a hand?你能帮我个忙吗? I need to read this file, which in Fortran 90 is easily read by我需要阅读这个文件,它在 Fortran 90 中很容易被阅读

int*4 n_particles, n_groups
real*4 group_id(n_particles)
read (*) n_particles, n_groups
read (*) (group_id(j),j=1,n_particles)

In detail, the file format is:详细来说,文件格式为:

Bytes 1-4 -- The integer 8.
Bytes 5-8 -- The number of particles, N.
Bytes 9-12 -- The number of groups.
Bytes 13-16 -- The integer 8.
Bytes 17-20 -- The integer 4*N.
Next many bytes -- The group ID numbers for all the particles.
Last 4 bytes -- The integer 4*N. 

How can I read this with Python?我如何用 Python 读取它? I tried everything but it never worked.我尝试了一切,但从未奏效。 Is there any chance I might use a f90 program in python, reading this binary file and then save the data that I need to use?我有没有可能在 python 中使用 f90 程序,读取这个二进制文件,然后保存我需要使用的数据?

Read the binary file content like this:像这样读取二进制文件内容:

with open(fileName, mode='rb') as file: # b is important -> binary
    fileContent = file.read()

then "unpack" binary data using struct.unpack :然后使用struct.unpack “解压”二进制数据:

The start bytes: struct.unpack("iiiii", fileContent[:20])起始字节: struct.unpack("iiiii", fileContent[:20])

The body: ignore the heading bytes and the trailing byte (= 24);正文:忽略标题字节和尾随字节(= 24); The remaining part forms the body, to know the number of bytes in the body do an integer division by 4;剩下的部分构成正文,要知道正文中的字节数,进行整数除以 4; The obtained quotient is multiplied by the string 'i' to create the correct format for the unpack method:获得的商乘以字符串'i'以创建 unpack 方法的正确格式:

struct.unpack("i" * ((len(fileContent) -24) // 4), fileContent[20:-4])

The end byte: struct.unpack("i", fileContent[-4:])结束字节: struct.unpack("i", fileContent[-4:])

In general, I would recommend that you look into using Python's struct module for this.一般来说,我建议您为此考虑使用 Python 的struct模块。 It's standard with Python, and it should be easy to translate your question's specification into a formatting string suitable for struct.unpack() .它是 Python 的标准,应该很容易将您的问题规范转换为适合struct.unpack()的格式字符串。

Do note that if there's "invisible" padding between/around the fields, you will need to figure that out and include it in the unpack() call, or you will read the wrong bits.请注意,如果字段之间/周围有“不可见”的填充,您需要弄清楚这一点并将其包含在unpack()调用中,否则您将读取错误的位。

Reading the contents of the file in order to have something to unpack is pretty trivial:读取文件的内容以便解压是非常简单的:

import struct

data = open("from_fortran.bin", "rb").read()

(eight, N) = struct.unpack("@II", data)

This unpacks the first two fields, assuming they start at the very beginning of the file (no padding or extraneous data), and also assuming native byte-order (the @ symbol).这将解压缩前两个字段,假设它们从文件的最开头开始(没有填充或无关数据),并且还假设本机字节顺序( @符号)。 The I s in the formatting string mean "unsigned integer, 32 bits".格式化字符串中的I表示“无符号整数,32 位”。

To read a binary file to a bytes object:要将二进制文件读入bytes对象:

from pathlib import Path
data = Path('/path/to/file').read_bytes()  # Python 3.5+

To create an int from bytes 0-3 of the data:要从数据的字节 0-3 创建一个int

i = int.from_bytes(data[:4], byteorder='little', signed=False)

To unpack multiple int s from the data:要从数据中解压缩多个int

import struct
ints = struct.unpack('iiii', data[:16])

You could use numpy.fromfile , which can read data from both text and binary files.您可以使用numpy.fromfile ,它可以从文本文件和二进制文件中读取数据。 You would first construct a data type, which represents your file format, using numpy.dtype , and then read this type from file using numpy.fromfile .您将首先使用numpy.fromfile numpy.dtype文件中读取此类型。

I too found Python lacking when it comes to reading and writing binary files, so I wrote a small module (for Python 3.6+).我也发现 Python 在读取和写入二进制文件方面缺乏,所以我写了一个小模块(用于 Python 3.6+)。

With binaryfile you'd do something like this (I'm guessing, since I don't know Fortran):使用binaryfile你会做这样的事情(我猜,因为我不知道 Fortran):

import binaryfile

def particle_file(f):
    f.array('group_ids')  # Declare group_ids to be an array (so we can use it in a loop)
    f.skip(4)  # Bytes 1-4
    num_particles = f.count('num_particles', 'group_ids', 4)  # Bytes 5-8
    f.int('num_groups', 4)  # Bytes 9-12
    f.skip(8)  # Bytes 13-20
    for i in range(num_particles):
        f.struct('group_ids', '>f')  # 4 bytes x num_particles
    f.skip(4)

with open('myfile.bin', 'rb') as fh:
    result = binaryfile.read(fh, particle_file)
print(result)

Which produces an output like this:产生这样的输出:

{
    'group_ids': [(1.0,), (0.0,), (2.0,), (0.0,), (1.0,)],
    '__skipped': [b'\x00\x00\x00\x08', b'\x00\x00\x00\x08\x00\x00\x00\x14', b'\x00\x00\x00\x14'],
    'num_particles': 5,
    'num_groups': 3
}

I used skip() to skip the additional data Fortran adds, but you may want to add a utility to handle Fortran records properly instead.我使用 skip() 来跳过 Fortran 添加的其他数据,但您可能希望添加一个实用程序来正确处理 Fortran 记录。 If you do, a pull request would be welcome.如果你这样做了,欢迎提出拉取请求。

If the data is array-like, I like to use numpy.memmap to load it.如果数据是类似数组的,我喜欢使用numpy.memmap来加载它。

Here's an example that loads 1000 samples from 64 channels, stored as two-byte integers.下面是一个示例,它从 64 个通道加载 1000 个样本,存储为两字节整数。

import numpy as np
mm = np.memmap(filename, np.int16, 'r', shape=(1000, 64))

You can then slice the data along either axis:然后,您可以沿任一轴对数据进行切片:

mm[5, :] # sample 5, all channels
mm[:, 5] # all samples, channel 5

All the usual formats are available, including C- and Fortran-order, various dtypes and endianness, etc.所有常用格式都可用,包括 C 和 Fortran 顺序、各种数据类型和字节顺序等。

Some advantages of this approach:这种方法的一些优点:

  • No data is loaded into memory until you actually use it (that's what a memmap is for).在您实际使用数据之前,不会将数据加载到内存中(这就是 memmap 的用途)。
  • More intuitive syntax (no need to generate a struct.unpack string consisting of 64000 character)更直观的语法(无需生成由 64000 个字符组成的 struct.unpack 字符串)
  • Data can be given any shape that makes sense for your application.可以为数据赋予对您的应用程序有意义的任何形状。

For non-array data (eg, compiled code), heterogeneous formats ("10 chars, then 3 ints, then 5 floats, ..."), or similar, one of the other approaches given above probably makes more sense.对于非数组数据(例如编译代码)、异构格式(“10 个字符,然后 3 个整数,然后 5 个浮点数,...”)或类似的,上面给出的其他方法之一可能更有意义。

#!/usr/bin/python

import array
data = array.array('f')
f = open('c:\\code\\c_code\\no1.dat', 'rb')
data.fromfile(f, 5)
print(data)
import pickle
f=open("filename.dat","rb")
try:
    while True:
        x=pickle.load(f)
        print x
except EOFError:
    pass
f.close()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM