简体   繁体   中英

Cannot understand the 32-bit encoding of the “Python” string

I am reading the Unicode HOWTO of the Python docs to start to really understand Unicode. At the Encodings Paragraph it shows a representation of the "Python" string in a 32-bit integers array.

I don't understand why each char has so many 00s. Like, the char "P" is represented by 0x50 (which I understand, being the hex equivalent for the ASCII ordinal 80). But then it is followed by 3 couples of 00s. What is that? How should I read this representation?

A 32-bit integers array consists of, well, 32-bit integers.

A byte is 8 bits, so each character necessarily consists of 4 bytes.

The number is 0x00000050, which is translated into four bytes. You could order them 0x50 0x00 0x00 0x00 (byte representing most significant numbers at the end -- "little endian") or 0x00 0x00 0x00 0x50 (least significant at the end -- "big endian"). Different CPUs make different choices for the order, as they note in the paragraph you link to.

If you think this is impractical: they are trying to explain in that paragraph why it is, and why another encoding is typically preferred.

Instead of starting at that article, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) manages to live up to its title pretty well.

The reason why there are so many zeroes there is because all of those letters are contained in the ASCII set, ie occupies one byte (two characters in hexadecimal notation). Unicode encodings are compatible with ASCII like that.

The rest is just filler of the remaining 3 bytes.

It is kind of like taking an original variable declared to be a (unsigned) byte , then copying it to an (unsigned) int32 -- you will get a lot of zeroes in the latter, because it is a bigger type.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM