简体   繁体   English

在C中解析二进制数据?

[英]Parsing Binary Data in C?

Are there any libraries or guides for how to read and parse binary data in C? 有没有关于如何在C中读取和解析二进制数据的库或指南?

I am looking at some functionality that will receive TCP packets on a network socket and then parse that binary data according to a specification, turning the information into a more useable form by the code. 我正在研究一些将在网络套接字上接收TCP数据包然后根据规范解析该二进制数据的功能,并通过代码将信息转换为更有用的形式。

Are there any libraries out there that do this, or even a primer on performing this type of thing? 是否有任何图书馆可以做到这一点,甚至是执行此类事情的入门书?

I have to disagree with many of the responses here. 我不得不同意这里的许多回应。 I strongly suggest you avoid the temptation to cast a struct onto the incoming data. 我强烈建议你避免将结构转换为传入数据的诱惑。 It seems compelling and might even work on your current target, but if the code is ever ported to another target/environment/compiler, you'll run into trouble. 它似乎很有吸引力,甚至可能适用于您当前的目标,但如果代码被移植到另一个目标/环境/编译器,您将遇到麻烦。 A few reasons: 原因如下:

Endianness : The architecture you're using right now might be big-endian, but your next target might be little-endian. Endianness :你现在使用的架构可能是big-endian,但你的下一个目标可能是little-endian。 Or vice-versa. 或相反亦然。 You can overcome this with macros (ntoh and hton, for example), but it's extra work and you have make sure you call those macros every time you reference the field. 您可以使用宏(例如ntoh和hton)来克服这个问题,但这是额外的工作,并确保每次引用该字段时调用这些宏。

Alignment : The architecture you're using might be capable of loading a mutli-byte word at an odd-addressed offset, but many architectures cannot. 对齐 :您正在使用的架构可能能够在奇数寻址偏移处加载多字节字,但许多架构不能。 If a 4-byte word straddles a 4-byte alignment boundary, the load may pull garbage. 如果一个4字节的字跨越一个4字节的对齐边界,那么负载可能会产生垃圾。 Even if the protocol itself doesn't have misaligned words, sometimes the byte stream itself is misaligned. 即使协议本身没有未对齐的字,有时字节流本身也是未对齐的。 (For example, although the IP header definition puts all 4-byte words on 4-byte boundaries, often the ethernet header pushes the IP header itself onto a 2-byte boundary.) (例如,虽然IP标头定义将所有4字节字放在4字节边界上,但以太网标头通常会将IP标头本身推送到2字节边界。)

Padding : Your compiler might choose to pack your struct tightly with no padding, or it might insert padding to deal with the target's alignment constraints. 填充 :您的编译器可能会选择紧密打包您的结构而不填充,或者它可能会插入填充以处理目标的对齐约束。 I've seen this change between two versions of the same compiler. 我在同一个编译器的两个版本之间看到了这种变化。 You could use #pragmas to force the issue, but #pragmas are, of course, compiler-specific. 您可以使用#pragmas强制解决问题,但#pragmas当然是特定于编译器的。

Bit Ordering : The ordering of bits inside C bitfields is compiler-specific. 位排序 :C位域内的位排序是特定于编译器的。 Plus, the bits are hard to "get at" for your runtime code. 另外,这些位很难为运行时代码“获取”。 Every time you reference a bitfield inside a struct, the compiler has to use a set of mask/shift operations. 每次在结构中引用位域时,编译器都必须使用一组掩码/移位操作。 Of course, you're going to have to do that masking/shifting at some point, but best not to do it at every reference if speed is a concern. 当然,你将不得不在某些时候进行掩蔽/移动,但如果速度是一个问题,最好不要在每次参考时都这样做。 (If space is the overriding concern, then use bitfields, but tread carefully.) (如果空间是最重要的问题,那么请使用位域,但要小心。)

All this is not to say "don't use structs." 这一切并不是说“不要使用结构”。 My favorite approach is to declare a friendly native-endian struct of all the relevant protocol data without any bitfields and without concern for the issues, then write a set of symmetric pack/parse routines that use the struct as a go-between. 我最喜欢的方法是声明所有相关协议数据的友好的native-endian结构,没有任何位域并且不关心问题,然后编写一组使用struct作为中间人的对称打包/解析例程。

typedef struct _MyProtocolData
{
    Bool myBitA;  // Using a "Bool" type wastes a lot of space, but it's fast.
    Bool myBitB;
    Word32 myWord;  // You have a list of base types like Word32, right?
} MyProtocolData;

Void myProtocolParse(const Byte *pProtocol, MyProtocolData *pData)
{
    // Somewhere, your code has to pick out the bits.  Best to just do it one place.
    pData->myBitA = *(pProtocol + MY_BITS_OFFSET) & MY_BIT_A_MASK >> MY_BIT_A_SHIFT;
    pData->myBitB = *(pProtocol + MY_BITS_OFFSET) & MY_BIT_B_MASK >> MY_BIT_B_SHIFT;

    // Endianness and Alignment issues go away when you fetch byte-at-a-time.
    // Here, I'm assuming the protocol is big-endian.
    // You could also write a library of "word fetchers" for different sizes and endiannesses.
    pData->myWord  = *(pProtocol + MY_WORD_OFFSET + 0) << 24;
    pData->myWord += *(pProtocol + MY_WORD_OFFSET + 1) << 16;
    pData->myWord += *(pProtocol + MY_WORD_OFFSET + 2) << 8;
    pData->myWord += *(pProtocol + MY_WORD_OFFSET + 3);

    // You could return something useful, like the end of the protocol or an error code.
}

Void myProtocolPack(const MyProtocolData *pData, Byte *pProtocol)
{
    // Exercise for the reader!  :)
}

Now, the rest of your code just manipulates data inside the friendly, fast struct objects and only calls the pack/parse when you have to interface with a byte stream. 现在,您的其余代码只是在友好,快速的struct对象中操作数据,并且只在必须与字节流接口时才调用pack / parse。 There's no need for ntoh or hton, and no bitfields to slow down your code. 不需要ntoh或hton,也没有位域来减慢代码速度。

The standard way to do this in C/C++ is really casting to structs as 'gwaredd' suggested 在C / C ++中执行此操作的标准方法实际上是以'gwaredd'建议的方式转换为结构体

It is not as unsafe as one would think. 它并不像人们想象的那样不安全。 You first cast to the struct that you expected, as in his/her example, then you test that struct for validity. 您首先转换为您期望的结构,如在他/她的示例中, 然后您测试该结构的有效性。 You have to test for max/min values, termination sequences, etc. 您必须测试最大/最小值,终止序列等。

What ever platform you are on you must read Unix Network Programming, Volume 1: The Sockets Networking API . 你在什么平台上必须阅读Unix网络编程,第1卷:套接字网络API Buy it, borrow it, steal it ( the victim will understand, it's like stealing food or something... ), but do read it. 买它,借它,偷它(受害者会理解,这就像偷食物或东西......),但要读它。

After reading the Stevens, most of this will make a lot more sense. 在阅读史蒂文斯之后,大部分内容都会更有意义。

Let me restate your question to see if I understood properly. 让我重申你的问题,看看我是否理解得当。 You are looking for software that will take a formal description of a packet and then will produce a "decoder" to parse such packets? 您正在寻找将对数据包进行正式描述的软件,然后生成一个“解码器”来解析这些数据包?

If so, the reference in that field is PADS . 如果是这样,该字段中的引用是PADS A good article introducing it is PADS: A Domain-Specific Language for Processing Ad Hoc Data . 介绍它的一篇好文章是PADS:用于处理Ad Hoc数据的领域专用语言 PADS is very complete but unfortunately under a non-free licence. PADS非常完整,但遗憾的是非自由许可。

There are possible alternatives (I did not mention non-C solutions). 有可能的替代方案(我没有提到非C解决方案)。 Apparently, none can be regarded as completely production-ready: 显然,没有一个可以被视为完全生产就绪:

If you read French, I summarized these issues in Génération de décodeurs de formats binaires . 如果你读法语,我在Générationdedécodeursdeformats binaires中总结了这些问题。

In my experience, the best way is to first write a set of primitives, to read/write a single value of some type from a binary buffer. 根据我的经验,最好的方法是首先编写一组基元,从二进制缓冲区读取/写入某种类型的单个值。 This gives you high visibility, and a very simple way to handle any endianness-issues: just make the functions do it right. 这为您提供了高可见性,以及处理任何字节序问题的非常简单的方法:只需使函数正确执行即可。

Then, you can for instance define struct s for each of your protocol messages, and write pack/unpack (some people call them serialize/deserialize) functions for each. 然后,您可以为每个协议消息定义struct ,并为每个消息编写pack / unpack(有些人称之为序列化/反序列化)函数。

As a base case, a primitive to extract a single 8-bit integer could look like this (assuming an 8-bit char on the host machine, you could add a layer of custom types to ensure that too, if needed): 作为基本情况,提取单个8位整数的原语可能如下所示(假设主机上有8位char ,您可以添加一层自定义类型以确保它也是如此):

const void * read_uint8(const void *buffer, unsigned char *value)
{
  const unsigned char *vptr = buffer;
  *value = *buffer++;
  return buffer;
}

Here, I chose to return the value by reference, and return an updated pointer. 在这里,我选择通过引用返回值,并返回更新的指针。 This is a matter of taste, you could of course return the value and update the pointer by reference. 这是一个品味问题,您当然可以返回值并通过引用更新指针。 It is a crucial part of the design that the read-function updates the pointer, to make these chainable. 读取函数更新指针,使这些链接成为设计的关键部分。

Now, we can write a similar function to read a 16-bit unsigned quantity: 现在,我们可以编写一个类似的函数来读取16位无符号数量:

const void * read_uint16(const void *buffer, unsigned short *value)
{
  unsigned char lo, hi;

  buffer = read_uint8(buffer, &hi);
  buffer = read_uint8(buffer, &lo);
  *value = (hi << 8) | lo;
  return buffer;
}

Here I assumed incoming data is big-endian, this is common in networking protocols (mainly for historical reasons). 这里我假设传入数据是big-endian,这在网络协议中很常见(主要是出于历史原因)。 You could of course get clever and do some pointer arithmetic and remove the need for a temporary, but I find this way makes it clearer and easier to understand. 你当然可以聪明地做一些指针算法并且不需要临时,但我发现这种方式使它更清晰,更容易理解。 Having maximal transparency in this kind of primitive can be a good thing when debugging. 在调试时,在这种原语中具有最大透明度是一件好事。

The next step would be to start defining your protocol-specific messages, and write read/write primitives to match. 下一步是开始定义特定于协议的消息,并编写读/写原语以进行匹配。 At that level, think about code generation; 在这个级别,考虑代码生成; if your protocol is described in some general, machine-readable format, you can generate the read/write functions from that, which saves a lot of grief. 如果您的协议以一般的机器可读格式描述,您可以从中生成读/写功能,这可以节省很多麻烦。 This is harder if the protocol format is clever enough , but often doable and highly recommended. 如果协议格式足够聪明 ,这会更难,但通常是可行的并且强烈建议。

You might be interested in Google Protocol Buffers , which is basically a serialization framework. 您可能对Google Protocol Buffers感兴趣,它基本上是一个序列化框架。 It's primarily for C++/Java/Python (those are the languages supported by Google) but there are ongoing efforts to port it to other languages, including C . 它主要用于C ++ / Java / Python(这些是Google支持的语言),但一直在努力将其移植到其他语言,包括C语言。 (I haven't used the C port at all, but I'm responsible for one of the C# ports.) (我根本没有使用过C端口,但我负责其中一个C#端口。)

You don't really need to parse binary data in C, just cast some pointer to whatever you think it should be. 你真的不需要在C中解析二进制数据,只需将一些指针转换为你认为它应该是什么。

struct SomeDataFormat
{
    ....
}

SomeDataFormat* pParsedData = (SomeDataFormat*) pBuffer;

Just be wary of endian issues, type sizes, reading off the end of buffers, etc etc 只要警惕端序问题,类型大小,读取缓冲区末尾等等

Parsing/formatting binary structures is one of the very few things that is easier to do in C than in higher-level/managed languages. 解析/格式化二元结构是极少数的东西比在更高级别/托管语言更容易在C ++做的一个。 You simply define a struct that corresponds to the format you want to handle and the struct is the parser/formatter. 您只需定义一个与您要处理的格式相对应的结构,结构就是解析器/格式化程序。 This works because a struct in C represents a precise memory layout (which is, of course, already binary). 这是有效的,因为C中的结构表示精确的内存布局(当然,它已经是二进制)。 See also kervin's and gwaredd's replies. 另见kervin和gwaredd的回复。

I'm not really understand what kind of library you are looking for ? 我真的不明白你在找什么样的图书馆? Generic library that will take any binary input and will parse it to unknown format? 将采用任何二进制输入并将其解析为未知格式的通用库? I'm not sure there is such library can ever exist in any language. 我不确定任何语言都可以存在这样的库。 I think you need elaborate your question a little bit. 我想你需要详细说明你的问题。

Edit : 编辑
Ok, so after reading Jon's answer seems there is a library, well kind of library it's more like code generation tool. 好吧,所以在看完Jon的答案后,似乎有一个库,很好的库,它更像是代码生成工具。 But as many stated just casting the data to the appropriate data structure, with appropriate carefulness ie using packed structures and taking care of endian issues you are good. 但是,正如许多人所说,只是将数据转换为适当的数据结构,并且要谨慎,即使用打包结构并处理字节序问题,这样做很好。 Using such tool with C it's just an overkill. 使用C这样的工具只是一种矫枉过正。

Basically suggestions about casting to struct work but please be aware that numbers can be represented differently on different architectures. 有关转换到基本的建议struct的工作,但请注意,号码可以在不同的不同的体系结构来表示。

To deal with endian issues network byte order was introduced - common practice is to convert numbers from host byte order to network byte order before sending the data and to convert back to host order on receipt. 为了处理字节序问题,引入了网络字节顺序 - 通常的做法是在发送数据之前将数字从主机字节顺序转换为网络字节顺序,并在接收时转换回主机顺序。 See functions htonl , htons , ntohl and ntohs . 请参阅函数htonlhtonsntohlntohs

And really consider kervin's advice - read UNP . 并且真的考虑科尔文的建议 - 阅读UNP You won't regret it! 你不会后悔的!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM