[英]Reading a binary file with python
I find particularly difficult reading binary file with Python.我发现用 Python 读取二进制文件特别困难。 Can you give me a hand?
你能帮我个忙吗? I need to read this file, which in Fortran 90 is easily read by
我需要阅读这个文件,它在 Fortran 90 中很容易被阅读
int*4 n_particles, n_groups
real*4 group_id(n_particles)
read (*) n_particles, n_groups
read (*) (group_id(j),j=1,n_particles)
In detail, the file format is:详细来说,文件格式为:
Bytes 1-4 -- The integer 8.
Bytes 5-8 -- The number of particles, N.
Bytes 9-12 -- The number of groups.
Bytes 13-16 -- The integer 8.
Bytes 17-20 -- The integer 4*N.
Next many bytes -- The group ID numbers for all the particles.
Last 4 bytes -- The integer 4*N.
How can I read this with Python?我如何用 Python 读取它? I tried everything but it never worked.
我尝试了一切,但从未奏效。 Is there any chance I might use a f90 program in python, reading this binary file and then save the data that I need to use?
我有没有可能在 python 中使用 f90 程序,读取这个二进制文件,然后保存我需要使用的数据?
Read the binary file content like this:像这样读取二进制文件内容:
with open(fileName, mode='rb') as file: # b is important -> binary
fileContent = file.read()
then "unpack" binary data using struct.unpack :然后使用struct.unpack “解压”二进制数据:
The start bytes: struct.unpack("iiiii", fileContent[:20])
起始字节:
struct.unpack("iiiii", fileContent[:20])
The body: ignore the heading bytes and the trailing byte (= 24);正文:忽略标题字节和尾随字节(= 24); The remaining part forms the body, to know the number of bytes in the body do an integer division by 4;
剩下的部分构成正文,要知道正文中的字节数,进行整数除以 4; The obtained quotient is multiplied by the string
'i'
to create the correct format for the unpack method:获得的商乘以字符串
'i'
以创建 unpack 方法的正确格式:
struct.unpack("i" * ((len(fileContent) -24) // 4), fileContent[20:-4])
The end byte: struct.unpack("i", fileContent[-4:])
结束字节:
struct.unpack("i", fileContent[-4:])
In general, I would recommend that you look into using Python's struct module for this.一般来说,我建议您为此考虑使用 Python 的struct模块。 It's standard with Python, and it should be easy to translate your question's specification into a formatting string suitable for
struct.unpack()
.它是 Python 的标准,应该很容易将您的问题规范转换为适合
struct.unpack()
的格式字符串。
Do note that if there's "invisible" padding between/around the fields, you will need to figure that out and include it in the unpack()
call, or you will read the wrong bits.请注意,如果字段之间/周围有“不可见”的填充,您需要弄清楚这一点并将其包含在
unpack()
调用中,否则您将读取错误的位。
Reading the contents of the file in order to have something to unpack is pretty trivial:读取文件的内容以便解压是非常简单的:
import struct
data = open("from_fortran.bin", "rb").read()
(eight, N) = struct.unpack("@II", data)
This unpacks the first two fields, assuming they start at the very beginning of the file (no padding or extraneous data), and also assuming native byte-order (the @
symbol).这将解压缩前两个字段,假设它们从文件的最开头开始(没有填充或无关数据),并且还假设本机字节顺序(
@
符号)。 The I
s in the formatting string mean "unsigned integer, 32 bits".格式化字符串中的
I
表示“无符号整数,32 位”。
To read a binary file to a bytes
object:要将二进制文件读入
bytes
对象:
from pathlib import Path
data = Path('/path/to/file').read_bytes() # Python 3.5+
To create an int
from bytes 0-3 of the data:要从数据的字节 0-3 创建一个
int
:
i = int.from_bytes(data[:4], byteorder='little', signed=False)
To unpack multiple int
s from the data:要从数据中解压缩多个
int
:
import struct
ints = struct.unpack('iiii', data[:16])
You could use numpy.fromfile
, which can read data from both text and binary files.您可以使用
numpy.fromfile
,它可以从文本文件和二进制文件中读取数据。 You would first construct a data type, which represents your file format, using numpy.dtype
, and then read this type from file using numpy.fromfile
.您将首先使用
numpy.fromfile
numpy.dtype
文件中读取此类型。
I too found Python lacking when it comes to reading and writing binary files, so I wrote a small module (for Python 3.6+).我也发现 Python 在读取和写入二进制文件方面缺乏,所以我写了一个小模块(用于 Python 3.6+)。
With binaryfile you'd do something like this (I'm guessing, since I don't know Fortran):使用binaryfile你会做这样的事情(我猜,因为我不知道 Fortran):
import binaryfile
def particle_file(f):
f.array('group_ids') # Declare group_ids to be an array (so we can use it in a loop)
f.skip(4) # Bytes 1-4
num_particles = f.count('num_particles', 'group_ids', 4) # Bytes 5-8
f.int('num_groups', 4) # Bytes 9-12
f.skip(8) # Bytes 13-20
for i in range(num_particles):
f.struct('group_ids', '>f') # 4 bytes x num_particles
f.skip(4)
with open('myfile.bin', 'rb') as fh:
result = binaryfile.read(fh, particle_file)
print(result)
Which produces an output like this:产生这样的输出:
{
'group_ids': [(1.0,), (0.0,), (2.0,), (0.0,), (1.0,)],
'__skipped': [b'\x00\x00\x00\x08', b'\x00\x00\x00\x08\x00\x00\x00\x14', b'\x00\x00\x00\x14'],
'num_particles': 5,
'num_groups': 3
}
I used skip() to skip the additional data Fortran adds, but you may want to add a utility to handle Fortran records properly instead.我使用 skip() 来跳过 Fortran 添加的其他数据,但您可能希望添加一个实用程序来正确处理 Fortran 记录。 If you do, a pull request would be welcome.
如果你这样做了,欢迎提出拉取请求。
If the data is array-like, I like to use numpy.memmap to load it.如果数据是类似数组的,我喜欢使用numpy.memmap来加载它。
Here's an example that loads 1000 samples from 64 channels, stored as two-byte integers.下面是一个示例,它从 64 个通道加载 1000 个样本,存储为两字节整数。
import numpy as np
mm = np.memmap(filename, np.int16, 'r', shape=(1000, 64))
You can then slice the data along either axis:然后,您可以沿任一轴对数据进行切片:
mm[5, :] # sample 5, all channels
mm[:, 5] # all samples, channel 5
All the usual formats are available, including C- and Fortran-order, various dtypes and endianness, etc.所有常用格式都可用,包括 C 和 Fortran 顺序、各种数据类型和字节顺序等。
Some advantages of this approach:这种方法的一些优点:
For non-array data (eg, compiled code), heterogeneous formats ("10 chars, then 3 ints, then 5 floats, ..."), or similar, one of the other approaches given above probably makes more sense.对于非数组数据(例如编译代码)、异构格式(“10 个字符,然后 3 个整数,然后 5 个浮点数,...”)或类似的,上面给出的其他方法之一可能更有意义。
#!/usr/bin/python
import array
data = array.array('f')
f = open('c:\\code\\c_code\\no1.dat', 'rb')
data.fromfile(f, 5)
print(data)
import pickle
f=open("filename.dat","rb")
try:
while True:
x=pickle.load(f)
print x
except EOFError:
pass
f.close()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.