简体   繁体   English

在任何平台上使用 ANSI C 保存和加载数据

[英]Save and load data using ANSI C on any platform

Say I have 1 million structs, each containing integers, doubles, strings, and other structs, something like:假设我有 100 万个结构,每个结构都包含整数、双精度、字符串和其他结构,例如:

struct s1 {
    int f1;
    long f2;
    char* f3;
};

struct s2 {
    struct s1* f1;
    double f2;
};

How can I save these to a file in binary format, then look up and load them from that file, on platforms different from the one the executable was compiled on, without worrying about endianness, float representation and other platform-specific gotchas?如何将这些以二进制格式保存到文件中,然后在不同于编译可执行文件的平台上从该文件中查找并加载它们,而不用担心字节顺序、浮点表示和其他特定于平台的问题?

The reason for preferring a binary format is mainly size of the resulting file.首选二进制格式的原因主要是生成文件的大小。 If integers alone look like "32435" and I have millions of them, the extra 3 bytes per integer would add quite a bit of size to the file.如果仅整数看起来像“32435”并且我有数百万个整数,那么每个 integer 额外的 3 个字节将为文件增加相当多的大小。

Write them as ascii text, XML or some similar non-binary format.将它们写为 ascii 文本、XML 或一些类似的非二进制格式。

"platforms different from the one the executable was compiled on" “与编译可执行文件的平台不同的平台”

How different from the one the executable was compiled on?与编译可执行文件的版本有何不同? Do you need to support platforms that use non-IEEE floats?您是否需要支持使用非 IEEE 浮点数的平台? Platforms that use non-ASCII characters?使用非 ASCII 字符的平台? Platforms that use non-8-bit bytes?使用非 8 位字节的平台?

If you insist on binary, and insist on doing it yourself, probably your best bet is to define that in the storage format, int and long will each be stored as a sequence of 4 bytes, little-endian (or big-endian, but pick one and stick to it regardless of platform), containing exactly 8 significant bits per byte.如果你坚持二进制,并且坚持自己做,可能你最好的办法是在存储格式中定义intlong将分别存储为 4 个字节的序列,小端(或大端,但无论平台如何,都选择一个并坚持使用它),每个字节恰好包含 8 个有效位。 double will be an IEEE double likewise. double同样将是一个 IEEE double。 Pointers introduce a whole world of hurt, the storage format must attach a unique identifier to each instance of s1 , and then a pointer to s1 can be stored as an id value, and looked up as part of deserialization.指针引入了整个世界,存储格式必须为s1的每个实例附加一个唯一标识符,然后可以将指向s1的指针存储为 id 值,并作为反序列化的一部分进行查找。

Different platforms can then decide what types they want to use for each of the storage types (so for example if int is only 16 bits on a give platform, it will just have to use long for both the int and long types. For this reason, you should give them domain-specific pseudonyms).然后,不同的平台可以决定他们想为每种存储类型使用什么类型(例如,如果int在给定平台上只有 16 位,则它只需要对intlong类型都使用long 。因此,你应该给他们特定领域的假名)。 Beware that it's impossible to avoid loss of precision in double values when converting to and from incompatible representations, since they might not have the same number of significant bits.请注意,在与不兼容的表示之间进行转换时,不可能避免双精度值的精度损失,因为它们可能具有不同数量的有效位。

For text, non-ASCII platforms will have to include code to serialize their own text format to ASCII, and to deserialize ASCII to native text.对于文本,非 ASCII 平台必须包含将自己的文本格式序列化为 ASCII 并将 ASCII 反序列化为原生文本的代码。 Strictly speaking, you should also avoid using any characters in the text that aren't in the C basic character set, since they might not be representable at all on the target.严格来说,您还应该避免在文本中使用任何不在 C 基本字符集中的字符,因为它们可能根本无法在目标上表示。 You can make a similar decision whether you're willing to count on target platforms to support Unicode in some way -- if so then UTF-8 is a reasonable interchange format for text.您可以做出类似的决定,是否愿意依靠目标平台以某种方式支持 Unicode - 如果是这样,那么 UTF-8 是一种合理的文本交换格式。

Finally, for each struct on each platform, you can either:最后,对于每个平台上的每个结构,您可以:

  1. write (or perhaps auto-generate) code to serialize it, and code to deserialize it, or:编写(或者可能是自动生成)代码来序列化它,以及反序列化它的代码,或者:
  2. make yourself a domain-specific language to define the structures, and a parser/interpreter that will serialize and deserialize according to that definition.让自己成为一种特定于领域的语言来定义结构,以及一个将根据该定义进行序列化和反序列化的解析器/解释器。

Sounds like a lot of work to me, though, to do something that's been done before.不过,对我来说,要做一些以前做过的事情,这听起来像是很多工作。

If you want to avoid the headaches that you've described, DON'T use binary.如果您想避免您所描述的令人头疼的问题,请不要使用二进制文件。 Use text, the universal* format.使用文本,通用* 格式。

*until you start getting into locales. *直到您开始进入语言环境。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM