在 Python 3 中將字符串轉換為字節的最佳方法？

Question

TypeError: 'str' does not support the buffer interface建議兩種可能的方法將字符串轉換為字節：

b = bytes(mystring, 'utf-8')

b = mystring.encode('utf-8')

哪種方法更 Pythonic？

Answer 1

如果您查看bytes的文檔，它會將您指向bytearray ：

bytearray([源[, 編碼[, 錯誤]]])

返回一個新的字節數組。 bytearray 類型是 0 <= x < 256 范圍內的整數的可變序列。它具有可變序列的大多數常用方法，在可變序列類型中描述，以及字節類型具有的大多數方法，請參閱字節和字節數組方法。

可選的 source 參數可用於以幾種不同的方式初始化數組：

如果是字符串，還必須給出編碼（和可選的錯誤）參數； bytearray() 然后使用 str.encode() 將字符串轉換為字節。

如果它是一個整數，則該數組將具有該大小並使用空字節進行初始化。

如果是符合buffer接口的對象，則使用該對象的只讀緩沖區來初始化bytes數組。

如果是可迭代的，則必須是 0 <= x < 256 范圍內的整數的可迭代，這些整數用作數組的初始內容。

如果沒有參數，則會創建一個大小為 0 的數組。

所以bytes可以做的不僅僅是編碼一個字符串。 Pythonic 允許您使用任何類型的源參數調用構造函數。

對於編碼字符串，我認為some_string.encode(encoding)比使用構造函數更 Pythonic，因為它是最自我記錄的——“獲取這個字符串並用這個編碼對其進行編碼”比bytes(some_string, encoding) ) 更清晰bytes(some_string, encoding) -- 使用構造函數時沒有顯式動詞。

編輯：我檢查了 Python 源代碼。 如果使用 CPython 將 unicode 字符串傳遞給bytes ，它會調用PyUnicode_AsEncodedString ，這是encode的實現； 所以如果你自己調用encode你只是跳過了一個間接級別。

另外，請參閱 Serdalis 的評論unicode_string.encode(encoding)也更加 Pythonic，因為它的逆是byte_string.decode(encoding)並且對稱性很好。

Answer 2

這比想象的要容易：

my_str = "hello world"
my_str_as_bytes = str.encode(my_str)
type(my_str_as_bytes) # ensure it is byte representation
my_decoded_str = my_str_as_bytes.decode()
type(my_decoded_str) # ensure it is string representation

Answer 3

絕對最好的方法不是第 2 種，而是第 3 種。 自 Python 3.0 以來， encode的第一個參數默認為'utf-8' 。 因此最好的方法是

b = mystring.encode()

這也會更快，因為默認參數不會導致 C 代碼中的字符串"utf-8" ，而是NULL ，檢查起來要快得多！

以下是一些時間安排：

In [1]: %timeit -r 10 'abc'.encode('utf-8')
The slowest run took 38.07 times longer than the fastest. 
This could mean that an intermediate result is being cached.
10000000 loops, best of 10: 183 ns per loop

In [2]: %timeit -r 10 'abc'.encode()
The slowest run took 27.34 times longer than the fastest. 
This could mean that an intermediate result is being cached.
10000000 loops, best of 10: 137 ns per loop

盡管有警告，但重復運行后時間非常穩定——偏差僅為約 2%。

使用不帶參數的encode()與 Python 2 不兼容，因為在 Python 2 中默認字符編碼是ASCII 。

>>> 'äöä'.encode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

Answer 4

您可以使用以下命令將字符串簡單地轉換為字節：

a_string.encode()

您可以使用以下命令簡單地將字節轉換為字符串：

some_bytes.decode()

bytes.decode和str.encode默認值是encoding='utf-8' 。

以下功能（摘自有效的Python ）可能是有用的轉換str以bytes和bytes到str ：

def to_bytes(bytes_or_str):
    if isinstance(bytes_or_str, str):
        value = bytes_or_str.encode() # uses 'utf-8' for encoding
    else:
        value = bytes_or_str
    return value # Instance of bytes


def to_str(bytes_or_str):
    if isinstance(bytes_or_str, bytes):
        value = bytes_or_str.decode() # uses 'utf-8' for encoding
    else:
        value = bytes_or_str
    return value # Instance of str

Answer 5

回答一個稍微不同的問題：

您有一個保存在 str 變量中的原始 unicode 序列：

s_str: str = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04"

您需要能夠獲取該 unicode 的字節文字（對於 struct.unpack() 等）

s_bytes: bytes = b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'

解決方案：

s_new: bytes = bytes(s, encoding="raw_unicode_escape")

參考（向上滾動查看標准編碼）：

Python 特定編碼

Answer 6

so_string = 'stackoverflow'
so_bytes = so_string.encode( )

Answer 7

Python 3 ' memoryview '方式怎么樣。

Memoryview 是 byte/bytearray 和 struct 模塊的一種混合體，有幾個好處。

不僅限於文本和字節，還可以處理 16 位和 32 位字
應對字節順序
為鏈接的 C/C++ 函數和數據提供開銷極低的接口

最簡單的例子，對於字節數組：

memoryview(b"some bytes").tolist()

[115, 111, 109, 101, 32, 98, 121, 116, 101, 115]

或者對於 unicode 字符串，（轉換為字節數組）

memoryview(bytes("\u0075\u006e\u0069\u0063\u006f\u0064\u0065\u0020", "UTF-16")).tolist()

[255, 254, 117, 0, 110, 0, 105, 0, 99, 0, 111, 0, 100, 0, 101, 0, 32, 0]

#Another way to do the same
memoryview("\u0075\u006e\u0069\u0063\u006f\u0064\u0065\u0020".encode("UTF-16")).tolist()

[255, 254, 117, 0, 110, 0, 105, 0, 99, 0, 111, 0, 100, 0, 101, 0, 32, 0]

也許您需要單詞而不是字節？

memoryview(bytes("\u0075\u006e\u0069\u0063\u006f\u0064\u0065\u0020", "UTF-16")).cast("H").tolist()

[65279, 117, 110, 105, 99, 111, 100, 101, 32]

memoryview(b"some  more  data").cast("L").tolist()

[1701670771, 1869422624, 538994034, 1635017060]

謹慎的話。 當數據超過一個字節時，請注意字節順序的多種解釋：

txt = "\u0075\u006e\u0069\u0063\u006f\u0064\u0065\u0020"
for order in ("", "BE", "LE"):
    mv = memoryview(bytes(txt, f"UTF-16{order}"))
    print(mv.cast("H").tolist())

[65279, 117, 110, 105, 99, 111, 100, 101, 32]
[29952, 28160, 26880, 25344, 28416, 25600, 25856, 8192]
[117, 110, 105, 99, 111, 100, 101, 32]

不確定這是故意的還是錯誤，但它讓我失望了！！

該示例使用 UTF-16，有關編解碼器的完整列表，請參閱Python 3.10 中的編解碼器注冊表

在 Python 3 中將字符串轉換為字節的最佳方法？

問題描述

5 個解決方案

解決方案1
735 已采納 2011-09-28 15:27:58

解決方案2
534 2013-07-06 07:09:28

解決方案3
214 2017-07-23 20:35:05

解決方案4
40 2017-09-04 12:42:51

解決方案5
30 2021-01-24 18:38:13

解決方案6
10 2017-04-05 16:16:21

解決方案7
2 2022-03-25 17:28:05

在 Python 3 中將字符串轉換為字節的最佳方法？

問題描述

5 個解決方案

解決方案1 735 已采納 2011-09-28 15:27:58

解決方案2 534 2013-07-06 07:09:28

解決方案3 214 2017-07-23 20:35:05

解決方案4 40 2017-09-04 12:42:51

解決方案5 30 2021-01-24 18:38:13

解決方案6 10 2017-04-05 16:16:21

解決方案7 2 2022-03-25 17:28:05

解決方案1
735 已采納 2011-09-28 15:27:58

解決方案2
534 2013-07-06 07:09:28

解決方案3
214 2017-07-23 20:35:05

解決方案4
40 2017-09-04 12:42:51

解決方案5
30 2021-01-24 18:38:13

解決方案6
10 2017-04-05 16:16:21

解決方案7
2 2022-03-25 17:28:05