简体   繁体   中英

Cross platform UTF-8 char file data encoding/decoding

I encode a binary file with data structures where one of the property is of wchar_t type for UTF-8 support.

Each struct looks like this:

 struct DataBlock{

    wchar_t charcode;
    int width;
    int height;
   ///etc

  }

The encoding happens on Windows where wchar_t size is 2 bytes.

The decoding of the file happens on Linux where the size is 4 bytes.So the read out values for charcode are wrong on the Linux side.

What is the best way to fix that difference without usage of 3rd party libs for UTF?Is it ok to encode charcode,for example into 'int' data type on win and then on Linux cast it to wchar_t?

Writing binary structures is inherently non portable. Bad things can happen almost everywhere :

  • for any type larger than a char you can have endianess problem
  • for any type shorter than 8 byte, you can have alignment problem - even if this can be mitigated with #pragma s on architectures and compilers that support it.

You should avoid that and instead use a kind of marshalling , that is serialization in a definite and architecture independant way. For example :

  • wchar_t charcode - assuming that your charcode will never use more than 2 bytes, you explicitely convert it to a char[2] (in fact I'm forcing a 2 bytes big endian representation):

     code[0] = (charcode >> 8) & 0xFF; code[1] = charcode & 0xFF; 
  • int - you know whether you need 2, 4 or 8 bytes to represent any value for width and height ; assuming it is 4 ( int32_t or uint32_t )

     code[0] = (width >> 24) & 0xFF; code[1] = (width >> 16) & 0xFF; code[2] = (width >> 8) & 0xFF; code[3] = width & 0xFF; 

So you explicitely define a conversion of your struct DataBlock in a char array with a definite size. Now you do have something portable over any network, architecture or compiler. Of course, you have do explicitely write the 2 routine for encoding and decoding, but it is the only way I know to have portable binary structures .

Hopefully, the htonx functions that can help you. They take explicitely 16 or 32 bits integers and force a conversion in network (big endian) order. From Linux man page :

#include <arpa/inet.h>
uint32_t htonl(uint32_t hostlong);
uint16_t htons(uint16_t hostshort);
uint32_t ntohl(uint32_t netlong);
uint16_t ntohs(uint16_t netshort);

The htonl() function converts the unsigned integer hostlong from host byte order to network byte order.

The htons() function converts the unsigned short integer hostshort from host byte order to network byte order.

The ntohl() function converts the unsigned integer netlong from network byte order to host byte order.

The ntohs() function converts the unsigned short integer netshort from network byte order to host byte order.

That way, you directly write the fields of your struct :

long l = htonl(data.charcode); // or htons if you only need 16 bits
fwrite(&l, sizeof(long), 1, fdout); // sizeof(short) if you used 16 bits

and same for reading :

long l;
fread(&l, sizeof(long), 1, fdin);
data.charcode = ntohl(l);

This functions have been defined for a long time under Unix-like systems, and seem to be defined on recent versions of Windows compilers.

Of course, if you are absolutely sure that you will only use little endian architectures, you could even not convert for endianess. But be sure to right that in your documentation preferently in a red flashing font ...

The unicode full character set requires currently 32 bits to represent all possible values:

  • The UTF-32 encoding stores these characters in 4 bytes, aka one uint32_t .
  • The UTF-16 encoding store each unicode character into one or two uint16_t
  • The UTF-8 encoding stores each unicode character into one to four uint8_t

Typically, windows uses wchar_t to store unicode text in UTF-16 encoding. At the time this was decided, UTF-16 was able to hold all the unicode caracter set, which is no longer true today . Linux uses an UTF-8 encoding. Most implementations use char to store unicode text in UTF-8.

The standard gives you some tools to cope with encoding conversions :

  • You can use the wbuffer_convert together with codecvt faced to convert between wchar_t UTF16 and UTF8 encoding when reading/writing streams.

  • You can also use wstring_convert to convert strings that are aloready loaded in memory between UTF16 and UTF8.

If you just want to use a cross system data structure in a binary file and without making conversions, just use:

struct DataBlock{
    uint16_t charcode;   // if you assing in windows from a wchar_t, no problem
    ...
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM