跨平台UTF-8 char文件数据编码/解码

Question

I encode a binary file with data structures where one of the property is of wchar_t type for UTF-8 support. 我使用数据结构对二进制文件进行编码，其中的属性之一是wchar_t类型，以支持UTF-8。

Each struct looks like this: 每个结构看起来像这样：

 struct DataBlock{

    wchar_t charcode;
    int width;
    int height;
   ///etc

  }

The encoding happens on Windows where wchar_t size is 2 bytes. 编码发生在Windows上，其中wchar_t的大小为2个字节。

The decoding of the file happens on Linux where the size is 4 bytes.So the read out values for charcode are wrong on the Linux side. 该文件的解码发生在大小为4个字节的Linux上，因此在Linux端读取charcode的值是错误的。

What is the best way to fix that difference without usage of 3rd party libs for UTF?Is it ok to encode charcode,for example into 'int' data type on win and then on Linux cast it to wchar_t? 在不使用UTF 第三方库的情况下解决此差异的最佳方法是什么？是否可以对字符码进行编码，例如在win上将其编码为'int'数据类型，然后在Linux上将其转换为wchar_t？

Answer 1

Writing binary structures is inherently non portable. 编写二进制结构本质上是不可移植的。 Bad things can happen almost everywhere : 坏事几乎可以在任何地方发生：

for any type larger than a char you can have endianess problem 对于任何大于char类型，您都可能会遇到持久性问题
for any type shorter than 8 byte, you can have alignment problem - even if this can be mitigated with #pragma s on architectures and compilers that support it. 对于任何小于8字节的类型，您都可能会遇到对齐问题-即使可以使用支持它的体系结构和编译器上的#pragma缓解此问题。

You should avoid that and instead use a kind of marshalling , that is serialization in a definite and architecture independant way. 您应该避免这种情况，而应使用一种编组方法，即以一种确定的且与架构无关的方式进行序列化。 For example : 例如：

wchar_t charcode - assuming that your charcode will never use more than 2 bytes, you explicitely convert it to a char[2] (in fact I'm forcing a 2 bytes big endian representation): wchar_t charcode -假设您的字符代码永远不会使用超过2个字节，则可以将其明确转换为char[2] （实际上，我强制使用2个字节的大字节序表示）：
```
 code[0] = (charcode >> 8) & 0xFF; code[1] = charcode & 0xFF; 
```
int - you know whether you need 2, 4 or 8 bytes to represent any value for width and height ; int 您知道需要2、4或8个字节来表示width和height任何值； assuming it is 4 ( int32_t or uint32_t ) 假设它是4（ int32_t或uint32_t ）
```
 code[0] = (width >> 24) & 0xFF; code[1] = (width >> 16) & 0xFF; code[2] = (width >> 8) & 0xFF; code[3] = width & 0xFF; 
```

So you explicitely define a conversion of your struct DataBlock in a char array with a definite size. 因此，您可以在具有确定大小的char数组中显式定义struct DataBlock的转换。 Now you do have something portable over any network, architecture or compiler. 现在，您确实可以通过任何网络，体系结构或编译器进行移植。 Of course, you have do explicitely write the 2 routine for encoding and decoding, but it is the only way I know to have portable binary structures . 当然，您确实明确编写了2个例程进行编码和解码，但这是我知道拥有可移植二进制结构的唯一方法。

Hopefully, the htonx functions that can help you. 希望htonx功能可以为您提供帮助。 They take explicitely 16 or 32 bits integers and force a conversion in network (big endian) order. 它们明确地采用16或32位整数，并以网络（大端）顺序强制转换。 From Linux man page : 从Linux手册页：

#include <arpa/inet.h>
uint32_t htonl(uint32_t hostlong);
uint16_t htons(uint16_t hostshort);
uint32_t ntohl(uint32_t netlong);
uint16_t ntohs(uint16_t netshort);

The htonl() function converts the unsigned integer hostlong from host byte order to network byte order. htonl（）函数将无符号整数hostlong从主机字节顺序转换为网络字节顺序。

The htons() function converts the unsigned short integer hostshort from host byte order to network byte order. htons（）函数将无符号的短整数hostshort从主机字节顺序转换为网络字节顺序。

The ntohl() function converts the unsigned integer netlong from network byte order to host byte order. ntohl（）函数将无符号整数netlong从网络字节顺序转换为主机字节顺序。

The ntohs() function converts the unsigned short integer netshort from network byte order to host byte order. ntohs（）函数将无符号的短整数netshort从网络字节顺序转换为主机字节顺序。

That way, you directly write the fields of your struct : 这样，您可以直接编写struct的字段：

long l = htonl(data.charcode); // or htons if you only need 16 bits
fwrite(&l, sizeof(long), 1, fdout); // sizeof(short) if you used 16 bits

and same for reading : 和阅读相同：

long l;
fread(&l, sizeof(long), 1, fdin);
data.charcode = ntohl(l);

This functions have been defined for a long time under Unix-like systems, and seem to be defined on recent versions of Windows compilers. 此功能已在类似Unix的系统中定义了很长时间，并且似乎已在Windows编译器的最新版本中定义。

Of course, if you are absolutely sure that you will only use little endian architectures, you could even not convert for endianess. 当然，如果您绝对确定将只使用小的字节序体系结构，则甚至无法转换字节序。 But be sure to right that in your documentation preferently in a red flashing font ... 但是请确保在文档中最好以红色闪烁字体对正...

Answer 2

The unicode full character set requires currently 32 bits to represent all possible values: Unicode全字符集当前需要32位才能表示所有可能的值：

The UTF-32 encoding stores these characters in 4 bytes, aka one uint32_t . UTF-32编码将这些字符存储为4个字节，也就是一个uint32_t 。
The UTF-16 encoding store each unicode character into one or two uint16_t UTF-16编码将每个unicode字符存储到一个或两个uint16_t
The UTF-8 encoding stores each unicode character into one to four uint8_t UTF-8编码将每个unicode字符存储到一到四个uint8_t

Typically, windows uses wchar_t to store unicode text in UTF-16 encoding. 通常， Windows使用 wchar_t以UTF-16编码存储Unicode文本。 At the time this was decided, UTF-16 was able to hold all the unicode caracter set, which is no longer true today . 在做出决定时，UTF-16能够容纳所有unicode角色集，而今天不再适用。 Linux uses an UTF-8 encoding. Linux使用UTF-8编码。 Most implementations use char to store unicode text in UTF-8. 大多数实现使用char在UTF-8中存储Unicode文本。

The standard gives you some tools to cope with encoding conversions : 该标准为您提供了一些工具来应对编码转换 ：

You can use the wbuffer_convert together with codecvt faced to convert between wchar_t UTF16 and UTF8 encoding when reading/writing streams. 您可以使用wbuffer_convert一起codecvt面临之间转换wchar_t读取/写入流时UTF16和UTF8编码。
You can also use wstring_convert to convert strings that are aloready loaded in memory between UTF16 and UTF8. 您也可以使用wstring_convert在UTF16和UTF8之间转换已预先装入内存的字符串。

If you just want to use a cross system data structure in a binary file and without making conversions, just use: 如果您只想在二进制文件中使用跨系统数据结构而不进行转换，请使用：

struct DataBlock{
    uint16_t charcode;   // if you assing in windows from a wchar_t, no problem
    ...
}

跨平台UTF-8 char文件数据编码/解码

问题描述

2 个解决方案

解决方案1
2 已采纳 2015-03-01 18:38:28

解决方案2
1 2015-03-01 17:32:22

跨平台UTF-8 char文件数据编码/解码

问题描述

2 个解决方案

解决方案1 2 已采纳 2015-03-01 18:38:28

解决方案2 1 2015-03-01 17:32:22

解决方案1
2 已采纳 2015-03-01 18:38:28

解决方案2
1 2015-03-01 17:32:22