简体   繁体   中英

How to unpack from a binary file a byte array using Python?

I'm giving myself a crash course in reading a binary file using Python. I'm new to both, so please bear with me.

The file format's documentation tells me that the first 16 bytes are a GUID and further reading tells me that this GUID is formatted thus:

typedef struct {
  unsigned long Data1;
  unsigned short Data2;
  unsigned short Data3;
  byte Data4[8];
} GUID, 
 UUID, 
 *PGUID;

I've got as far us being able to unpack the first three entries in the struct, but I'm getting stumped on #4. It's an array of 8 bytes I think but I'm not sure how to unpack it.

import struct

fp = open("./file.bin", mode='rb')

Data1 = struct.unpack('<L', fp.read(4)) # unsigned long, little-endian
Data2 = struct.unpack('<H', fp.read(2)) # unsigned short, little-endian 
Data3 = struct.unpack('<H', fp.read(2)) # unsigned short, little-endian
Data4 = struct.unpack('<s', bytearray(fp.read(8))) # byte array with 8 entries?

struct.error: unpack requires a bytes object of length 1

What am I doing wrong for Data4? (I'm using Python 3.2 BTW)

Data1 thru 3 are OK. If I use hex() on them I am getting the correct data that I'd expect to see (woohoo) I'm just failing over on the syntax of this byte array.

Edit: Answer

I'm reading a GUID as defined in MS-DTYP and this nailed it:

data = uuid.UUID(bytes_le=fp.read(16))

If you want an 8-byte string, you need to put the number 8 in there:

struct.unpack('<8s', bytearray(fp.read(8)))

From the docs :

A format character may be preceded by an integral repeat count. For example, the format string '4h' means exactly the same as 'hhhh'.

For the 's' format character, the count is interpreted as the length of the bytes, not a repeat count like for the other format characters; for example, '10s' means a single 10-byte string, while '10c' means 10 characters. If a count is not given, it defaults to 1. For packing, the string is truncated or padded with null bytes as appropriate to make it fit. For unpacking, the resulting bytes object always has exactly the specified number of bytes. As a special case, '0s' means a single, empty string (while '0c' means 0 characters).


However, I'm not sure why you're doing this in the first place.

fp.read(8) gives you an 8-byte bytes object. You want an 8-byte bytes object. So, just do this:

Data4 = fp.read(8)

Converting the bytes to a bytearray has no effect except to make a mutable copy. Unpacking it just gives you back a copy of the same bytes you started with. So… why?


Well, actually, struct.unpack returns a tuple whose one value is a copy of the same bytes you started with, but you can do that with:

Data4 = (fp.read(8),)

Which raises the question of why you want four single-element tuples in the first place. You're going to be doing Data1[0] , etc. all over the place for no good reason. Why not this?

Data1, Data2, Data3, Data4 = struct.unpack('<LHH8s', fp.read(16))

Of course if this is meant to read a UUID, it's always better to use the "batteries included" than to try to build your own batteries from nickel and cadmium ore. As icktoofay says, just use the uuid module:

data = uuid.UUID(bytes_le=fp.read(16))

But keep in mind that Python's uuid uses the 4-2-2-1-1-6 format, not the 4-2-2-8 format. If you really need exactly that format, you'll need to convert it, which means either struct or bit twiddling anyway. (Microsoft's GUID makes things even more fun by using a 4-2-2-2-6 format, which is not the same as either, and representing the first 3 in native-endian and the last two in big-endian, because they like to make things easier…)

UUIDs are supported by Python with the uuid module . Do something like this:

import uuid

my_uuid = uuid.UUID(bytes_le=fp.read(16))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM