简体   繁体   English

读取结构定义的二进制文件

[英]Reading binary file defined by a struct

Could somebody point me in the right direction of how I could read a binary file that is defined by a C struct? 有人能指出我如何读取由C结构定义的二进制文件的正确方向吗? It has a few #define inside of the struct, which makes me thing that it will complicate things. 它在结构体内部有一些#define,这使我的事情变得复杂。
The structure looks something like this: (although its larger and more complicated than this) 结构看起来像这样:(虽然它比它更大,更复杂)

struct Format {
    unsigned long str_totalstrings;
    unsigned long str_name;
    #define STR_ORDERED 0x2
    #define STR_ROT13 0x4
    unsigned char stuff[4];
    #define str_delimiter stuff[0]
}

I would really appreciate it if somebody could point me in the right direction on how to do this. 如果有人能指出我如何做到这一点的正确方向,我将非常感激。 Or if theres any tutorial out there that covers this topic? 或者如果那里有任何涵盖这个主题的教程?

Thanks a lot in advance for your help. 非常感谢您的帮助。

There are some bad ideas and good ideas: 有一些不好的想法和好主意:

That's a bad idea to: 这是一个坏主意:

  • Typecast a raw buffer into struct Typecast结构的原始缓冲区
    • There are endianness issues (little-endian vs big-endian) when parsing integers >1 byte long or floats 解析整数> 1个字节长或浮点数时存在字节序问题(little-endian vs big-endian)
    • There are byte alignment issues in structures, which are very compiler-dependent. 结构中存在字节对齐问题 ,这些问题与编译器有关。 One can try to disable alignment (or enforce some manual alignment), but it's generally a bad idea too. 可以尝试禁用对齐(或强制执行一些手动对齐),但这通常也是一个坏主意。 At the very least, you'll ruin performance by making CPU access unaligned integers. 至少,你会通过使CPU访问未对齐的整数来破坏性能。 Internal RISC core would have to do 3-4 ops instead of 1 (ie "do part 1 in first word", "do part 2 in second word", "merge the result") to access it every time. 内部RISC核心必须执行3-4次操作而不是1次(即“在第一个单词中执行第1部分”,“在第二个单词中执行第2部分”,“合并结果”)以便每次都访问它。 Or worse, compiler pragmas to control alignment will be ignored and your code will break. 或者更糟糕的是,控制对齐的编译器编译指示将被忽略,您的代码将会中断。
    • There are no exact size guarantees for regular int , long , short , etc, type in C/C++. 对于C / C ++中的常规intlongshort等类型,没有确切的大小保证。 You can use stuff like int16_t , but these are available only on modern compilers. 您可以使用int16_t东西,但这些只能在现代编译器上使用。
    • Of course, this approach breaks completely when using structures that reference other structures: one has to unroll them all manually. 当然,当使用引用其他结构的结构时,这种方法完全破坏:必须手动将它们全部展开。
  • Write parsers manually: it's much harder than it seems on the first glance. 手动编写解析器:它比第一眼看上去困难得多。
    • A good parser needs to do lots of sanity checking on every stage. 一个好的解析器需要在每个阶段进行大量的健全性检查。 It's easy to miss something. 很容易错过一些东西。 It is even easier to miss something if you don't use exceptions. 如果不使用异常,则更容易错过。
    • Using exceptions makes you prone to fail if your parsing code is not exception-safe (ie written in a way that it can be interrupted at some points and it won't leak memory / forget to finalize some objects) 如果您的解析代码不是异常安全的(例如,它可以在某些点被中断并且不会泄漏内存/忘记最终确定某些对象),则使用异常会使您容易失败
    • There could be performance issues (ie doing lots of unbuffered IO instead of doing one OS read syscall and parsing a buffer then — or vice versa, reading whole thing at once instead of more granular, lazy reads where it's applicable). 可能存在性能问题(即执行大量无缓冲的IO而不是执行一个OS read系统调用然后解析缓冲区 - 反之亦然,一次读取整个内容而不是更适合的细粒度,懒惰读取)。

It's a good idea to 这是一个好主意

  • Go cross-platform. 跨平台。 Pretty much self-explanatory, with all the mobile devices, routers and IoT stuff booming around in the recent years. 几乎不言自明,近年来所有的移动设备,路由器和物联网都在蓬勃发展。
  • Go declarative. 去声明。 Consider using any of declarative specs to describe your structure and then use a parser generator to generate a parser. 考虑使用任何声明性规范来描述您的结构,然后使用解析器生成器来生成解析器。

There are several tools available to do that: 有几种工具可以做到这一点:

  • Kaitai Struct — my favorite so far, cross-platform, cross-language — ie you describe your structure once and then you can compile it into a parser in C++, C#, Java, Python, Ruby, PHP, etc. Kaitai Struct - 迄今为止我最喜欢的跨平台跨语言 - 即您只需描述一次结构,然后就可以将其编译成C ++,C#,Java,Python,Ruby,PHP等解析器。
  • binpac — pretty dated, but still usable, C++-only — similar to Kaitai in ideology, but unsupported since 2013 binpac - 相当陈旧,但仍然可用,仅限C ++ - 与意识形态中的Kaitai相似,但自2013年以来未得到支持
  • Spicy — said to be "modern rewrite" of binpac, AKA "binpac++", but still in early stages of development; - 据说是“现代改写”的binpac,AKA“binpac ++”,但仍处于早期发展阶段; can be used for smaller tasks, C++ only too. 可以用于较小的任务,C ++也是如此。

Reading a binary defined by a struct is easy. 读取结构定义的二进制文件很容易。

Format myFormat;
fread(&myFormat, sizeof(Format), 1, fp);

the #defines don't affect the structure at all. #defines根本不影响结构。 (Inside is an odd place to put them, though). (尽管如此,里面是一个奇怪的地方)。

However, this is not cross-platform safe. 但是,这不是跨平台安全的。 It is the simplest thing that will possibly work, in situations where you are assured the reader and writer are using the same platform. 在确保读者和作者使用相同平台的情况下,这是最简单的事情。

The better way would be to re-define your structure as such: 更好的方法是重新定义您的结构:

struct Format {
    Uint32 str_totalstrings;  //assuming unsigned long was 32 bits on the writer.
    Uint32 str_name;
    unsigned char stuff[4];
};

and then have a 'platform_types.h" which typedefs Uint32 correctly for your compiler. Now you can read directly into the structure, but for endianness issues you still need to do something like this: 然后有一个'platform_types.h',它可以为你的编译器正确输入Uint32。现在你可以直接读到结构,但是对于字节序问题,你仍然需要这样做:

myFormat.str_totalstrings = FileToNative32(myFormat.str_totalstrings);
myFormat.str_name =   FileToNative32(str_name);

where FileToNative is either a no-op or a byte reverser depending on platform. 其中FileToNative是无操作或字节反转器,具体取决于平台。

You can also use unions to do this parsing if you have the data you want to parse already in memory. 如果您要在内存中分析要解析的数据,也可以使用联合进行此分析。

union A {
    char* buffer;
    Format format;
};

A a;
a.buffer = stuff_you_want_to_parse;

// You can now access the members of the struct through the union.
if (a.format.str_name == "...")
    // do stuff

Also remember that long could be different sizes on different platforms. 还要记住,不同平台上的长度可能不同。 If you are depending on long being a certain size, consider using the types defined int stdint.h such as uint32_t. 如果您依赖于long一定大小,请考虑使用int stdint.h中定义的类型,例如uint32_t。

Using C++ I/O library: 使用C ++ I / O库:

#include <fstream>
using namespace std;

ifstream ifs("file.dat", ios::binary);
Format f;
ifs.get(&f, sizeof f);

Using CI/O library: 使用CI / O库:

#include <cstdio>
using namespace std;

FILE *fin = fopen("file.dat", "rb");
Format f;
fread(&f, sizeof f, 1, fin);

You have to find out the endiannes of the machine where the file was written so you can interpret integers properly. 您必须找到写入文件的机器的endiannes,以便正确解释整数。 Look out for ILP32 vs LP64 mismatch. 注意ILP32与LP64的不匹配。 The original structure packing/alignment might also be important. 原始结构包装/对齐也可能很重要。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM