简体   繁体   中英

Parsing Binary Data in C?

Are there any libraries or guides for how to read and parse binary data in C?

I am looking at some functionality that will receive TCP packets on a network socket and then parse that binary data according to a specification, turning the information into a more useable form by the code.

Are there any libraries out there that do this, or even a primer on performing this type of thing?

I have to disagree with many of the responses here. I strongly suggest you avoid the temptation to cast a struct onto the incoming data. It seems compelling and might even work on your current target, but if the code is ever ported to another target/environment/compiler, you'll run into trouble. A few reasons:

Endianness : The architecture you're using right now might be big-endian, but your next target might be little-endian. Or vice-versa. You can overcome this with macros (ntoh and hton, for example), but it's extra work and you have make sure you call those macros every time you reference the field.

Alignment : The architecture you're using might be capable of loading a mutli-byte word at an odd-addressed offset, but many architectures cannot. If a 4-byte word straddles a 4-byte alignment boundary, the load may pull garbage. Even if the protocol itself doesn't have misaligned words, sometimes the byte stream itself is misaligned. (For example, although the IP header definition puts all 4-byte words on 4-byte boundaries, often the ethernet header pushes the IP header itself onto a 2-byte boundary.)

Padding : Your compiler might choose to pack your struct tightly with no padding, or it might insert padding to deal with the target's alignment constraints. I've seen this change between two versions of the same compiler. You could use #pragmas to force the issue, but #pragmas are, of course, compiler-specific.

Bit Ordering : The ordering of bits inside C bitfields is compiler-specific. Plus, the bits are hard to "get at" for your runtime code. Every time you reference a bitfield inside a struct, the compiler has to use a set of mask/shift operations. Of course, you're going to have to do that masking/shifting at some point, but best not to do it at every reference if speed is a concern. (If space is the overriding concern, then use bitfields, but tread carefully.)

All this is not to say "don't use structs." My favorite approach is to declare a friendly native-endian struct of all the relevant protocol data without any bitfields and without concern for the issues, then write a set of symmetric pack/parse routines that use the struct as a go-between.

typedef struct _MyProtocolData
{
    Bool myBitA;  // Using a "Bool" type wastes a lot of space, but it's fast.
    Bool myBitB;
    Word32 myWord;  // You have a list of base types like Word32, right?
} MyProtocolData;

Void myProtocolParse(const Byte *pProtocol, MyProtocolData *pData)
{
    // Somewhere, your code has to pick out the bits.  Best to just do it one place.
    pData->myBitA = *(pProtocol + MY_BITS_OFFSET) & MY_BIT_A_MASK >> MY_BIT_A_SHIFT;
    pData->myBitB = *(pProtocol + MY_BITS_OFFSET) & MY_BIT_B_MASK >> MY_BIT_B_SHIFT;

    // Endianness and Alignment issues go away when you fetch byte-at-a-time.
    // Here, I'm assuming the protocol is big-endian.
    // You could also write a library of "word fetchers" for different sizes and endiannesses.
    pData->myWord  = *(pProtocol + MY_WORD_OFFSET + 0) << 24;
    pData->myWord += *(pProtocol + MY_WORD_OFFSET + 1) << 16;
    pData->myWord += *(pProtocol + MY_WORD_OFFSET + 2) << 8;
    pData->myWord += *(pProtocol + MY_WORD_OFFSET + 3);

    // You could return something useful, like the end of the protocol or an error code.
}

Void myProtocolPack(const MyProtocolData *pData, Byte *pProtocol)
{
    // Exercise for the reader!  :)
}

Now, the rest of your code just manipulates data inside the friendly, fast struct objects and only calls the pack/parse when you have to interface with a byte stream. There's no need for ntoh or hton, and no bitfields to slow down your code.

The standard way to do this in C/C++ is really casting to structs as 'gwaredd' suggested

It is not as unsafe as one would think. You first cast to the struct that you expected, as in his/her example, then you test that struct for validity. You have to test for max/min values, termination sequences, etc.

What ever platform you are on you must read Unix Network Programming, Volume 1: The Sockets Networking API . Buy it, borrow it, steal it ( the victim will understand, it's like stealing food or something... ), but do read it.

After reading the Stevens, most of this will make a lot more sense.

Let me restate your question to see if I understood properly. You are looking for software that will take a formal description of a packet and then will produce a "decoder" to parse such packets?

If so, the reference in that field is PADS . A good article introducing it is PADS: A Domain-Specific Language for Processing Ad Hoc Data . PADS is very complete but unfortunately under a non-free licence.

There are possible alternatives (I did not mention non-C solutions). Apparently, none can be regarded as completely production-ready:

If you read French, I summarized these issues in Génération de décodeurs de formats binaires .

In my experience, the best way is to first write a set of primitives, to read/write a single value of some type from a binary buffer. This gives you high visibility, and a very simple way to handle any endianness-issues: just make the functions do it right.

Then, you can for instance define struct s for each of your protocol messages, and write pack/unpack (some people call them serialize/deserialize) functions for each.

As a base case, a primitive to extract a single 8-bit integer could look like this (assuming an 8-bit char on the host machine, you could add a layer of custom types to ensure that too, if needed):

const void * read_uint8(const void *buffer, unsigned char *value)
{
  const unsigned char *vptr = buffer;
  *value = *buffer++;
  return buffer;
}

Here, I chose to return the value by reference, and return an updated pointer. This is a matter of taste, you could of course return the value and update the pointer by reference. It is a crucial part of the design that the read-function updates the pointer, to make these chainable.

Now, we can write a similar function to read a 16-bit unsigned quantity:

const void * read_uint16(const void *buffer, unsigned short *value)
{
  unsigned char lo, hi;

  buffer = read_uint8(buffer, &hi);
  buffer = read_uint8(buffer, &lo);
  *value = (hi << 8) | lo;
  return buffer;
}

Here I assumed incoming data is big-endian, this is common in networking protocols (mainly for historical reasons). You could of course get clever and do some pointer arithmetic and remove the need for a temporary, but I find this way makes it clearer and easier to understand. Having maximal transparency in this kind of primitive can be a good thing when debugging.

The next step would be to start defining your protocol-specific messages, and write read/write primitives to match. At that level, think about code generation; if your protocol is described in some general, machine-readable format, you can generate the read/write functions from that, which saves a lot of grief. This is harder if the protocol format is clever enough , but often doable and highly recommended.

You might be interested in Google Protocol Buffers , which is basically a serialization framework. It's primarily for C++/Java/Python (those are the languages supported by Google) but there are ongoing efforts to port it to other languages, including C . (I haven't used the C port at all, but I'm responsible for one of the C# ports.)

You don't really need to parse binary data in C, just cast some pointer to whatever you think it should be.

struct SomeDataFormat
{
    ....
}

SomeDataFormat* pParsedData = (SomeDataFormat*) pBuffer;

Just be wary of endian issues, type sizes, reading off the end of buffers, etc etc

Parsing/formatting binary structures is one of the very few things that is easier to do in C than in higher-level/managed languages. You simply define a struct that corresponds to the format you want to handle and the struct is the parser/formatter. This works because a struct in C represents a precise memory layout (which is, of course, already binary). See also kervin's and gwaredd's replies.

I'm not really understand what kind of library you are looking for ? Generic library that will take any binary input and will parse it to unknown format? I'm not sure there is such library can ever exist in any language. I think you need elaborate your question a little bit.

Edit :
Ok, so after reading Jon's answer seems there is a library, well kind of library it's more like code generation tool. But as many stated just casting the data to the appropriate data structure, with appropriate carefulness ie using packed structures and taking care of endian issues you are good. Using such tool with C it's just an overkill.

Basically suggestions about casting to struct work but please be aware that numbers can be represented differently on different architectures.

To deal with endian issues network byte order was introduced - common practice is to convert numbers from host byte order to network byte order before sending the data and to convert back to host order on receipt. See functions htonl , htons , ntohl and ntohs .

And really consider kervin's advice - read UNP . You won't regret it!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM