简体   繁体   中英

Parsing binary files with Python

As a side project I would like to try to parse binary files (Mach-O files specifically). I know tools exist for this already (otool) so consider this a learning exercise.

The problem I'm hitting is that I don't understand how to convert the binary elements found into a python representation. For example, the Mach-O file format starts with a header which is defined by a C Struct. The first item is a uint_32 'magic number' field. When i do

magic = f.read(4)

I get

b'\xcf\xfa\xed\xfe'

This is starting to make sense to me. It's literally a byte array of 4 bytes. However I want to treat this like a 4-byte int that represents the original magic number. Another example is the numberOfSections field. I just want the number represented by 4-byte field, not an array of literal bytes.

Perhaps I'm thinking about this all wrong. Has anybody worked on anything similar? Do I need to write functions to look these 4-byte byte arrays and shift and combine their values to produce the number I want? Is endienness going to screw me here? Any pointers would be most helpful.

Take a look at the struct module:

In [1]: import struct

In [2]: magic = b'\xcf\xfa\xed\xfe'

In [3]: decoded = struct.unpack('<I', magic)[0]

In [4]: hex(decoded)
Out[4]: '0xfeedfacf'

There's Kaitai Struct project that solves exactly that problem. First, you describe a certain file format using a .ksy spec, then you compile it into a Python library (or, actually, a library in any other major programming language), import it, and, voila, parsing boils down to:

from mach_o import MachO
my_file = MachO.from_file("/path/to/your/file")
my_file.magic # => 0xfeedface
my_file.num_of_sections # => some other integer
my_file.sections # => list of objects that represent sections

They have a growing repository of file format specs . It doesn't have Mach-O file format spec (yet?), but there are complex formats like Java .class or Microsoft's PE executable described there, so I guess it shouldn't be a major problem to write spec for Mach-O format as well.

It is actually better than Construct or Hachoir , because it's compiled (as opposed to interpreted), thus it's faster, and it includes tons of other helpful tools like visualizer or format diagram maker. For example, this is a generated explanation diagram for PE executable format:

PE可执行格式

I would sugest the Construct module. It offeres a very high level interface.

I wrote a code recipe a while back that aims to simplify this syntax. Check it out and see if it helps:

http://code.activestate.com/recipes/577610-decoding-binary-files/?in=user-4175703

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM