简体   繁体   中英

UTF-8, Unicode and how does machine interprete bytes?

The following things I have realized:

  1. Unicode character can be represented as up to 4 bytes sequence. So, if a character is represented in two or more bytes - byte ordering is important regarding to BEM or LEM
  2. UTF-8 write bytes into file/network stream byte by byte (not multibytes writing or reading) that means if a character is represented in two or more bytes, while encoding it writes one byte at time. Then it does not matter BEM or LEM while decoding it always reads back bytes correctly and does not swap them when writing or reading.
  3. UTF-16 or UTF-32 use always two or four bytes while encoding, so LEM or BEM now really matter because of multibytes reading/writing.
  4. In addition, I understand how UTF-8 knows to interpret bytes as a character while reading from a file (decoding).

So. here is the example:

I declared and initialized String variable as "ANФГ" in C++.
Questions.

  1. In C++ char is a one byte character data type. String class is based on char[] in C++ ?
  2. Can I declare a String variable this way? UTF-8 Encoding is default?
  3. I decided to write this string into a file. This string should be represented as A - one byte, B - one byte, Ф - two bytes sequence, Г - two bytes sequence. How will it be stored in String and in a file? What addreses will be for these 6 bytes?
  4. How will it be read from a file regarding BEM and LEM ? C++ knows the order of the addresses in memory where these bytes are stored?

EDIT_1: I dont understand one thing. If I have three bytes: - 1000 1111 - 1100 0000 - 0100 0000 The first one and the second one represent one character in UTF-8, the third one represents one as well. The order of bytes is I wrote above. Every byte has his own address, right? But when multibytes writing happen two bytes are stored at one place? I mean, any output stream writes data in order left-to-right? Then it will be read back left-to-right as well? Because LEM or BEM swap bytes.. but when it is multibytes writing. But when we write only one byte at time it has his own correct order left-to-right?

  1. Yes, std::string (or rather, std::basic_string<char> ) uses char to store its data. It is encoding-agnostic, so if you for instance call size() you will get the actual number of char s representing the string, not the number of characters or code points.
  2. No, the encoding of string literals is implementation-defined. Since C++11, you can use the u8 prefix to get UTF-8 string literals (eg u8"ANФГ" ).
  3. If you've used UTF-8 string literals throughout, the std::string will contain UTF-8, and UTF-8 will be written to file if you're using eg operator<<() .
  4. C++ does not keep track of whatever character encoding your file happens to be in (and therefore does not keep track of its endianness either). If you happen to be using UTF-8 end-to-end, endianness is irrelevant since UTF-8 is endianness-independent .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM