Python 3.x 中字符串的內部表示是什么

Question

在 Python 3.x 中，字符串由 Unicode 序數項組成。 （請參閱下面語言參考中的引用。）Unicode 字符串的內部表示是什么？ 是 UTF-16 嗎？

字符串對象的項是 Unicode 代碼單元。 一個 Unicode 代碼單元由一個項目的字符串對象表示，可以保存一個 16 位或 32 位的值來表示一個 Unicode 序數（序數的最大值在 sys.maxunicode 中給出，取決於 Python 如何在編譯時配置）。 Unicode 對象中可能存在代理對，並將報告為兩個單獨的項目。

Answer 1

內部表示將在實現PEP 393 的Python 3.3 中更改。 新表示將選擇 ascii、latin-1、utf-8、utf-16、utf-32 中的一個或幾個，通常試圖獲得緊湊的表示。

隱式轉換為代理對只會在與遺留 API 交談時完成（那些只存在於 windows 上，其中 wchar_t 是兩個字節）； Python 字符串將被保留。 這是發行說明。

Answer 2

在 Python 3.3 及更高版本中，字符串的內部表示將取決於字符串，並且可以是 latin-1、UCS-2 或 UCS-4 中的任何一個，如PEP 393 中所述。

對於以前的 Python，內部表示取決於 Python 的構建標志。 Python 可以使用標志值--enable-unicode=ucs2或--enable-unicode=ucs4 。 ucs2構建實際上使用 UTF-16 作為其內部表示，而ucs4構建使用 UCS-4 / UTF-32。

Answer 3

查看 CPython 3.1.5 的源代碼，在Include/unicodeobject.h ：

/* --- Unicode Type ------------------------------------------------------- */

typedef struct {
    PyObject_HEAD
    Py_ssize_t length;          /* Length of raw Unicode data in buffer */
    Py_UNICODE *str;            /* Raw Unicode buffer */
    long hash;                  /* Hash value; -1 if not set */
    int state;                  /* != 0 if interned. In this case the two
                                 * references from the dictionary to this object
                                 * are *not* counted in ob_refcnt. */
    PyObject *defenc;           /* (Default) Encoded version as Python
                                   string, or NULL; this is used for
                                   implementing the buffer protocol */
} PyUnicodeObject;

字符存儲為Py_UNICODE數組。 在大多數平台上，我相信Py_UNICODE是#define d as wchar_t 。

Answer 4

這取決於：請參閱此處。 就內部表示而言，這對於 Python 3 仍然適用。

Answer 5

內部表示從 latin-1、UCS-2 到 UCS-4 不等。 UCS 意味着表示為 2 或 4 個字節長，並且 unicode 代碼單元在數字上等於相應的代碼點。 我們可以通過查找代碼單元大小變化的位置來檢查這一點。

為了表明它們的范圍從 latin-1 的 1 個字節到 UCS-4 的 4 個字節：

>>> getsizeof('')           
49
>>> getsizeof('a')  #------------------ + 1 byte as the representaion here is latin-1 
50
>>> getsizeof('\U0010ffff') 
80
>>> getsizeof('\U0010ffff\U0010ffff') # + 4 bytes as the representation here is UCS-4
84

我們可以檢查一開始的表示確實是 latin-1 而不是 UTF-8，因為 2 字節代碼單元的變化發生在字節邊界而不是''\\U0000007f' - '\\U00000080'邊界，如 UTF- 8：

>>> getsizeof('\U0000007f')  
50
>>> getsizeof('\U00000080') #----------The size of the string changes at \x74 - \x80 boundary but..
74
>>> getsizeof('\U00000080\U00000080') # ..the size of the code-unit is still one. so not UTF-8
75

>>> getsizeof('\U000000ff')  
74
>>> getsizeof('\U000000ff\U000000ff')# (+1 byte)    
75
>>> getsizeof('\U00000100')  
76
>>> getsizeof('\U00000100\U00000100') # Size change at byte boundary(+2 bytes). Rep is UCS-2.             
78

>>> getsizeof('\U0000ffff') 
76
>>> getsizeof('\U0000ffff\U0000ffff') # (+ 2 bytes)
78
>>> getsizeof('\U00010000')            
80
>>> getsizeof('\U00010000\U00010000') # (+ 4 bytes) Thes size of the code unit changes to 4 at byte boundary again.
84

Answer 6

我認為，很難判斷 UTF-16（只是 16 位字的序列）與 Python 的字符串對象之間的區別。

如果 python 是用 Unicode=UCS4 選項編譯的，它將在 UTF-32 和 Python 字符串之間進行比較。

因此，最好考慮一下，盡管您可以相互轉換，但它們屬於不同的類別。

Answer 7

Python 2.X 和 3.X 之間的 Unicode 內部表示沒有變化。

它絕對不是 UTF-16。 UTF-anything 是面向字節的 EXTERNAL 表示。

每個代碼單元（字符、代理等）都被分配了一個范圍（0, 2 ** 21）的數字。 這被稱為它的“序數”。

真的，您引用的文檔說明了一切。 大多數 Python 二進制文件使用 16 位序數，這將你限制在基本多語言平面（“BMP”），除非你想用代理來搗亂（如果你找不到你的頭發襯衫並且你的指甲床被取消了，這很方便）生銹）。 要使用完整的 Unicode 曲目，您更喜歡“寬構建”（32 位寬）。

簡而言之，unicode 對象的內部表示是一個 16 位無符號整數數組，或一個 32 位無符號整數數組（僅使用 21 位）。

Answer 8

>>> import array; s = 'Привет мир!'; b = array.array('u', s).tobytes(); print(b); print(len(s) * 4 == len(b))
b'\x1f\x04\x00\x00@\x04\x00\x008\x04\x00\x002\x04\x00\x005\x04\x00\x00B\x04\x00\x00 \x00\x00\x00<\x04\x00\x008\x04\x00\x00@\x04\x00\x00!\x00\x00\x00'
True
>>> import array; s = 'test'; b = array.array('u', s).tobytes(); print(b); print(len(s) * 4 == len(b))
b't\x00\x00\x00e\x00\x00\x00s\x00\x00\x00t\x00\x00\x00'
True
>>>

Python 3.x 中字符串的內部表示是什么

問題描述

8 個解決方案

解決方案1
32 2012-01-31 13:03:36

解決方案2
18 2015-02-28 20:32:55

解決方案3
9 2009-12-03 09:25:36

解決方案4
3 2009-12-03 07:12:22

解決方案5
1 2020-08-19 17:54:18

解決方案6
0 2009-12-03 07:18:44

解決方案7
0 已采納 2009-12-03 07:37:52

解決方案8
0 2018-08-18 13:29:24

Python 3.x 中字符串的內部表示是什么

問題描述

8 個解決方案

解決方案1 32 2012-01-31 13:03:36

解決方案2 18 2015-02-28 20:32:55

解決方案3 9 2009-12-03 09:25:36

解決方案4 3 2009-12-03 07:12:22

解決方案5 1 2020-08-19 17:54:18

解決方案6 0 2009-12-03 07:18:44

解決方案7 0 已采納 2009-12-03 07:37:52

解決方案8 0 2018-08-18 13:29:24

解決方案1
32 2012-01-31 13:03:36

解決方案2
18 2015-02-28 20:32:55

解決方案3
9 2009-12-03 09:25:36

解決方案4
3 2009-12-03 07:12:22

解決方案5
1 2020-08-19 17:54:18

解決方案6
0 2009-12-03 07:18:44

解決方案7
0 已采納 2009-12-03 07:37:52

解決方案8
0 2018-08-18 13:29:24