简体   繁体   English

Python使用三个Unicode字符代表亚洲句号吗? 这很奇怪吗?

[英]Python uses three unicode characters to represent an asian fullstop? This is weird?

The python file: python文件:

# -*- coding: utf-8 -*-

print u"。" 
print [u"。".encode('utf8')]

Produces: 产生:

。
['\xe3\x80\x82']

Why does python use 3 characters to store my 1 fullstop? 为什么python使用3个字符存储我的1个句号? This is really strange, if you print each one out individually, they are all different as well. 这真的很奇怪,如果您单独打印每个,它们也都不同。 Any ideas? 有任何想法吗?

In UTF-8, three bytes (not really characters) are used to represent code points between U+07FF and U+FFFF, such as this character, IDEOGRAPHIC FULL STOP (U+3002). 在UTF-8中,三个字节(不是真正的字符)用于表示U + 07FF和U + FFFF之间的代码点,例如该字符IDEOGRAPHIC FULL STOP(U + 3002)。

Try dumping the script file with od -x . 尝试使用od -x转储脚本文件。 You should find the same three bytes used to represent the character there. 您应该在其中找到用于表示字符的相同的三个字节。

UTF-8是一种多字节字符表示形式,因此非ASCII字符将占用一个以上的字节。

Looks correctly UTF-8 encoded to me. 看起来对我来说是正确的UTF-8编码。 See here for an explanation about UTF-8 encoding. 有关UTF-8编码的说明,请参见此处

The latest version of Unicode supports more than 109,000 characters in 93 different scripts. Unicode的最新版本在93个不同的脚本中支持超过109,000个字符。 Mathematically, the minimum number of bytes you'd need to encode that number of code points is 3, since this is 17 bits' worth of information. 从数学上讲,编码该数量的代码点所需的最小字节数为3,因为这是17位的信息。 (Unicode actually reserves a 21-bit range, but this still fits in 3 bytes.) You might therefore reasonably expect every character to need 3 bytes in the most straightforward imaginable encoding, in which each character is represented as an integer using the smallest possible whole number of bytes. (Unicode实际上保留了21位的范围,但是仍然可以容纳3个字节。)因此,您可以合理地期望每个字符都需要最直观的可想象编码中的3个字节,其中每个字符都使用尽可能最小的整数表示字节总数。 (In fact, as pointed out by dan04, you need 4 bytes to get all of Unicode's functionality.) (实际上,如dan04所指出的,您需要4个字节才能获得Unicode的所有功能。)

A common data compression technique is to use short tokens to represent frequently-occurring elements, even though this means that infrequently-occurring elements will need longer tokens than they otherwise might. 常见的数据压缩技术是使用短标记来表示频繁出现的元素,即使这意味着不经常出现的元素将需要比其他方式更长的标记。 UTF-8 is a Unicode encoding that uses this approach to store text written in English and other European languages in fewer bytes, at the cost of needing more bytes for text written in other languages. UTF-8是一种Unicode编码,它使用此方法以较少的字节存储用英语和其他欧洲语言编写的文本,但以其他字节编写的文本需要更多的字节为代价。 In UTF-8, the most common Latin characters need only 1 byte (UTF-8 overlaps with ASCII for the convenience of English users), and other common characters need only 2 bytes. 在UTF-8中,最常见的拉丁字符仅需要1个字节(为方便英语用户,UTF-8与ASCII重叠),而其他常见字符仅需要2个字节。 But some characters need 3 or even 4 bytes, which is more than they'd need in a "naive" encoding. 但是有些字符需要3甚至4个字节,这比“天真”编码所需要的更多。 The particular character you're asking about needs 3 bytes in UTF-8 by definition. 根据定义,您要询问的特定字符在UTF-8中需要3个字节。

In UTF-16, it happens, this code point would need only 2 bytes, though other characters will need 4 (there are no 3-byte characters in UTF-16). 碰巧在UTF-16中,此代码点将仅需要2个字节,而其他字符将需要4个字节(UTF-16中没有3个字节的字符)。 If you are truly concerned with space efficiency, do as John Machin suggests in his comment and use an encoding that is designed to be maximally space-efficient for your language. 如果您真正关心空间效率,请按照John Machin在其评论中的建议进行操作,并使用为您的语言最大程度地节省空间而设计的编码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM