简体   繁体   English

Python字符串,默认编码和解码(UTF-8?)

[英]Python Strings, Default Encoding and Decoding (UTF-8?)

Based on my own readings (including this article ), it seems that by default Python encodes with UTF-8. 基于我自己的读数(包括本文 ),似乎默认情况下Python使用UTF-8进行编码。 Strings are read in under the assumption that they're in UTF-8 encoding ( more source ). 在假设它们采用UTF-8编码( 更多源代码 )的情况下读入字符串。

Those strings are then translated to plain Unicode, using Latin-1, UCS-2, or UCS-4 for the entire string depending on the highest code point of UTF-8 it encounters. 然后将这些字符串转换为纯Unicode,使用Latin-1,UCS-2或UCS-4作为整个字符串,具体取决于它遇到的UTF-8的最高代码点。 This seems to match what I've done on the terminal. 这似乎与我在终端上所做的相符。 The character Ǧ has Unicode code point of 486, and can only be fit in UCS-2. 字符Ǧ的Unicode代码点为486,只能适用于UCS-2。

string1 = "Ǧ"
sys.getsizeof(string1)  # This prints 76 
string1 = "Ǧa"
sys.getsizeof(string1)  # This prints 78, as if 'a' takes two bytes

string2 = "a"
sys.getsizeof(string2)  # This prints 50 
string2 = "aa"
sys.getsizeof(string2)  # This prints 51, as if 'a' takes one byte

I have two questions. 我有两个问题。 First off, when printing to terminal, what is the process with which strings are encoded and decoded? 首先,当打印到终端时,字符串被编码和解码的过程是什么? If we call print(), are the strings first encoded to UTF-8 (from UCS-2 or Latin-1 in our examples), where the system decodes it to print to screen? 如果我们调用print(),首先将字符串编码为UTF-8(在我们的示例中来自UCS-2或Latin-1),系统将其解码为打印到屏幕? Second off, what's with the large initial increment in the size? 第二关,大小的初始增量是多少? Why do strings represented with Latin-1 have an initial size of 49, while strings with UCS-2 have an initial size of 74? 为什么用Latin-1表示的字符串的初始大小为49,而具有UCS-2的字符串的初始大小为74?

Thanks! 谢谢!

Most of your points are related to PEP 393: Flexible string representation . 您的大部分要点与PEP 393相关:灵活的字符串表示 While UTF-8 is used (on Python 3) as the default source code encoding, the default encoding for file I/O is locale based, and the internal representation is ASCII, latin-1, UTF-16 or UTF-32, depending on the largest code point, possibly with a cached UTF-8 representation and/or a cached wchar_t representation for use with specific C APIs (deprecated APIs in the case of the wchar_t representation). 虽然使用UTF-8(在Python 3上)作为默认源代码编码,但文件I / O的默认编码是基于语言环境的,内部表示是ASCII,latin-1,UTF-16或UTF-32,具体取决于在最大的代码点上,可能使用缓存的UTF-8表示和/或缓存的wchar_t表示,以用于特定的C API(在wchar_t表示的情况下不推荐使用的API)。

So to answer your questions: 那么回答你的问题:

  1. The terminal encoding, as noted, is platform dependent ; 如上所述,终端编码取决于平台 ; the internal representation is reencoded to whatever your platform requires and output as bytes. 内部表示将重新编码为您的平台所需的任何内容,并以字节形式输出。

  2. The change in the base size between ASCII and UTF-16 strings is because the flexible string representation uses a larger baseline struct for non-ASCII strings (it needs additional space to store a pointer for the cached UTF-8 encoding required by some C level APIs for instance), as well as more bytes per character. ASCII和UTF-16字符串之间的基本大小的变化是因为灵活的字符串表示对非ASCII字符串使用更大的基线结构(它需要额外的空间来存储某些C级所需的缓存UTF-8编码的指针例如API),以及每个字符更多的字节。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM