[英]Cross platform UTF-8 char file data encoding/decoding
I encode a binary file with data structures where one of the property is of wchar_t type for UTF-8 support. 我使用数据结构对二进制文件进行编码,其中的属性之一是wchar_t类型,以支持UTF-8。
Each struct looks like this: 每个结构看起来像这样:
struct DataBlock{
wchar_t charcode;
int width;
int height;
///etc
}
The encoding happens on Windows where wchar_t size is 2 bytes. 编码发生在Windows上,其中wchar_t的大小为2个字节。
The decoding of the file happens on Linux where the size is 4 bytes.So the read out values for charcode are wrong on the Linux side. 该文件的解码发生在大小为4个字节的Linux上,因此在Linux端读取charcode的值是错误的。
What is the best way to fix that difference without usage of 3rd party libs for UTF?Is it ok to encode charcode,for example into 'int' data type on win and then on Linux cast it to wchar_t? 在不使用UTF 第三方库的情况下解决此差异的最佳方法是什么?是否可以对字符码进行编码,例如在win上将其编码为'int'数据类型,然后在Linux上将其转换为wchar_t?
Writing binary structures is inherently non portable. 编写二进制结构本质上是不可移植的。 Bad things can happen almost everywhere : 坏事几乎可以在任何地方发生:
char
you can have endianess problem 对于任何大于char
类型,您都可能会遇到持久性问题 #pragma
s on architectures and compilers that support it. 对于任何小于8字节的类型,您都可能会遇到对齐问题-即使可以使用支持它的体系结构和编译器上的#pragma
缓解此问题。 You should avoid that and instead use a kind of marshalling , that is serialization in a definite and architecture independant way. 您应该避免这种情况,而应使用一种编组方法 ,即以一种确定的且与架构无关的方式进行序列化。 For example : 例如 :
wchar_t charcode
- assuming that your charcode will never use more than 2 bytes, you explicitely convert it to a char[2]
(in fact I'm forcing a 2 bytes big endian representation): wchar_t charcode
-假设您的字符代码永远不会使用超过2个字节,则可以将其明确转换为char[2]
(实际上,我强制使用2个字节的大字节序表示):
code[0] = (charcode >> 8) & 0xFF; code[1] = charcode & 0xFF;
int
- you know whether you need 2, 4 or 8 bytes to represent any value for width
and height
; int
您知道需要2、4或8个字节来表示width
和height
任何值; assuming it is 4 ( int32_t
or uint32_t
) 假设它是4( int32_t
或uint32_t
)
code[0] = (width >> 24) & 0xFF; code[1] = (width >> 16) & 0xFF; code[2] = (width >> 8) & 0xFF; code[3] = width & 0xFF;
So you explicitely define a conversion of your struct DataBlock
in a char
array with a definite size. 因此,您可以在具有确定大小的char
数组中显式定义struct DataBlock
的转换。 Now you do have something portable over any network, architecture or compiler. 现在,您确实可以通过任何网络,体系结构或编译器进行移植。 Of course, you have do explicitely write the 2 routine for encoding and decoding, but it is the only way I know to have portable binary structures . 当然,您确实明确编写了2个例程进行编码和解码,但这是我知道拥有可移植二进制结构的唯一方法。
Hopefully, the htonx
functions that can help you. 希望htonx
功能可以为您提供帮助。 They take explicitely 16 or 32 bits integers and force a conversion in network (big endian) order. 它们明确地采用16或32位整数,并以网络 (大端)顺序强制转换。 From Linux man page : 从Linux手册页:
#include <arpa/inet.h>
uint32_t htonl(uint32_t hostlong);
uint16_t htons(uint16_t hostshort);
uint32_t ntohl(uint32_t netlong);
uint16_t ntohs(uint16_t netshort);
The htonl() function converts the unsigned integer hostlong from host byte order to network byte order. htonl()函数将无符号整数hostlong从主机字节顺序转换为网络字节顺序。
The htons() function converts the unsigned short integer hostshort from host byte order to network byte order. htons()函数将无符号的短整数hostshort从主机字节顺序转换为网络字节顺序。
The ntohl() function converts the unsigned integer netlong from network byte order to host byte order. ntohl()函数将无符号整数netlong从网络字节顺序转换为主机字节顺序。
The ntohs() function converts the unsigned short integer netshort from network byte order to host byte order. ntohs()函数将无符号的短整数netshort从网络字节顺序转换为主机字节顺序。
That way, you directly write the fields of your struct : 这样,您可以直接编写struct的字段:
long l = htonl(data.charcode); // or htons if you only need 16 bits
fwrite(&l, sizeof(long), 1, fdout); // sizeof(short) if you used 16 bits
and same for reading : 和阅读相同:
long l;
fread(&l, sizeof(long), 1, fdin);
data.charcode = ntohl(l);
This functions have been defined for a long time under Unix-like systems, and seem to be defined on recent versions of Windows compilers. 此功能已在类似Unix的系统中定义了很长时间,并且似乎已在Windows编译器的最新版本中定义。
Of course, if you are absolutely sure that you will only use little endian architectures, you could even not convert for endianess. 当然,如果您绝对确定将只使用小的字节序体系结构,则甚至无法转换字节序。 But be sure to right that in your documentation preferently in a red flashing font ... 但是请确保在文档中最好以红色闪烁字体对正...
The unicode full character set requires currently 32 bits to represent all possible values: Unicode全字符集当前需要32位才能表示所有可能的值:
uint32_t
. UTF-32编码将这些字符存储为4个字节,也就是一个uint32_t
。 uint16_t
UTF-16编码将每个unicode字符存储到一个或两个uint16_t
uint8_t
UTF-8编码将每个unicode字符存储到一到四个uint8_t
Typically, windows uses wchar_t
to store unicode text in UTF-16 encoding. 通常, Windows使用 wchar_t
以UTF-16编码存储Unicode文本。 At the time this was decided, UTF-16 was able to hold all the unicode caracter set, which is no longer true today . 在做出决定时,UTF-16能够容纳所有unicode角色集,而今天不再适用 。 Linux uses an UTF-8 encoding. Linux使用UTF-8编码。 Most implementations use char
to store unicode text in UTF-8. 大多数实现使用char
在UTF-8中存储Unicode文本。
The standard gives you some tools to cope with encoding conversions : 该标准为您提供了一些工具来应对编码转换 :
You can use the wbuffer_convert
together with codecvt
faced to convert between wchar_t
UTF16 and UTF8 encoding when reading/writing streams. 您可以使用wbuffer_convert
一起codecvt
面临之间转换wchar_t
读取/写入流时UTF16和UTF8编码。
You can also use wstring_convert
to convert strings that are aloready loaded in memory between UTF16 and UTF8. 您也可以使用wstring_convert
在UTF16和UTF8之间转换已预先装入内存的字符串。
If you just want to use a cross system data structure in a binary file and without making conversions, just use: 如果您只想在二进制文件中使用跨系统数据结构而不进行转换,请使用:
struct DataBlock{
uint16_t charcode; // if you assing in windows from a wchar_t, no problem
...
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.