I encode a binary file with data structures where one of the property is of wchar_t type for UTF-8 support.
Each struct looks like this:
struct DataBlock{
wchar_t charcode;
int width;
int height;
///etc
}
The encoding happens on Windows where wchar_t size is 2 bytes.
The decoding of the file happens on Linux where the size is 4 bytes.So the read out values for charcode are wrong on the Linux side.
What is the best way to fix that difference without usage of 3rd party libs for UTF?Is it ok to encode charcode,for example into 'int' data type on win and then on Linux cast it to wchar_t?
Writing binary structures is inherently non portable. Bad things can happen almost everywhere :
char
you can have endianess problem #pragma
s on architectures and compilers that support it. You should avoid that and instead use a kind of marshalling , that is serialization in a definite and architecture independant way. For example :
wchar_t charcode
- assuming that your charcode will never use more than 2 bytes, you explicitely convert it to a char[2]
(in fact I'm forcing a 2 bytes big endian representation):
code[0] = (charcode >> 8) & 0xFF; code[1] = charcode & 0xFF;
int
- you know whether you need 2, 4 or 8 bytes to represent any value for width
and height
; assuming it is 4 ( int32_t
or uint32_t
)
code[0] = (width >> 24) & 0xFF; code[1] = (width >> 16) & 0xFF; code[2] = (width >> 8) & 0xFF; code[3] = width & 0xFF;
So you explicitely define a conversion of your struct DataBlock
in a char
array with a definite size. Now you do have something portable over any network, architecture or compiler. Of course, you have do explicitely write the 2 routine for encoding and decoding, but it is the only way I know to have portable binary structures .
Hopefully, the htonx
functions that can help you. They take explicitely 16 or 32 bits integers and force a conversion in network (big endian) order. From Linux man page :
#include <arpa/inet.h>
uint32_t htonl(uint32_t hostlong);
uint16_t htons(uint16_t hostshort);
uint32_t ntohl(uint32_t netlong);
uint16_t ntohs(uint16_t netshort);
The htonl() function converts the unsigned integer hostlong from host byte order to network byte order.
The htons() function converts the unsigned short integer hostshort from host byte order to network byte order.
The ntohl() function converts the unsigned integer netlong from network byte order to host byte order.
The ntohs() function converts the unsigned short integer netshort from network byte order to host byte order.
That way, you directly write the fields of your struct :
long l = htonl(data.charcode); // or htons if you only need 16 bits
fwrite(&l, sizeof(long), 1, fdout); // sizeof(short) if you used 16 bits
and same for reading :
long l;
fread(&l, sizeof(long), 1, fdin);
data.charcode = ntohl(l);
This functions have been defined for a long time under Unix-like systems, and seem to be defined on recent versions of Windows compilers.
Of course, if you are absolutely sure that you will only use little endian architectures, you could even not convert for endianess. But be sure to right that in your documentation preferently in a red flashing font ...
The unicode full character set requires currently 32 bits to represent all possible values:
uint32_t
. uint16_t
uint8_t
Typically, windows uses wchar_t
to store unicode text in UTF-16 encoding. At the time this was decided, UTF-16 was able to hold all the unicode caracter set, which is no longer true today . Linux uses an UTF-8 encoding. Most implementations use char
to store unicode text in UTF-8.
The standard gives you some tools to cope with encoding conversions :
You can use the wbuffer_convert
together with codecvt
faced to convert between wchar_t
UTF16 and UTF8 encoding when reading/writing streams.
You can also use wstring_convert
to convert strings that are aloready loaded in memory between UTF16 and UTF8.
If you just want to use a cross system data structure in a binary file and without making conversions, just use:
struct DataBlock{
uint16_t charcode; // if you assing in windows from a wchar_t, no problem
...
}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.