简体   繁体   中英

Python uses three unicode characters to represent an asian fullstop? This is weird?

The python file:

# -*- coding: utf-8 -*-

print u"。" 
print [u"。".encode('utf8')]

Produces:

。
['\xe3\x80\x82']

Why does python use 3 characters to store my 1 fullstop? This is really strange, if you print each one out individually, they are all different as well. Any ideas?

In UTF-8, three bytes (not really characters) are used to represent code points between U+07FF and U+FFFF, such as this character, IDEOGRAPHIC FULL STOP (U+3002).

Try dumping the script file with od -x . You should find the same three bytes used to represent the character there.

UTF-8是一种多字节字符表示形式,因此非ASCII字符将占用一个以上的字节。

Looks correctly UTF-8 encoded to me. See here for an explanation about UTF-8 encoding.

The latest version of Unicode supports more than 109,000 characters in 93 different scripts. Mathematically, the minimum number of bytes you'd need to encode that number of code points is 3, since this is 17 bits' worth of information. (Unicode actually reserves a 21-bit range, but this still fits in 3 bytes.) You might therefore reasonably expect every character to need 3 bytes in the most straightforward imaginable encoding, in which each character is represented as an integer using the smallest possible whole number of bytes. (In fact, as pointed out by dan04, you need 4 bytes to get all of Unicode's functionality.)

A common data compression technique is to use short tokens to represent frequently-occurring elements, even though this means that infrequently-occurring elements will need longer tokens than they otherwise might. UTF-8 is a Unicode encoding that uses this approach to store text written in English and other European languages in fewer bytes, at the cost of needing more bytes for text written in other languages. In UTF-8, the most common Latin characters need only 1 byte (UTF-8 overlaps with ASCII for the convenience of English users), and other common characters need only 2 bytes. But some characters need 3 or even 4 bytes, which is more than they'd need in a "naive" encoding. The particular character you're asking about needs 3 bytes in UTF-8 by definition.

In UTF-16, it happens, this code point would need only 2 bytes, though other characters will need 4 (there are no 3-byte characters in UTF-16). If you are truly concerned with space efficiency, do as John Machin suggests in his comment and use an encoding that is designed to be maximally space-efficient for your language.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM