简体   繁体   English

如何在Python2中将utf-8字节数组解码为字符串?

[英]How can I decode a utf-8 byte array to a string in Python2?

I have an array of bytes representing a utf-8 encoded string. 我有一个字节数组,表示utf-8编码的字符串。 I want to decode these bytes back into the string in Pyton2. 我想将这些字节解码回Pyton2中的字符串。 I am relying on Python2 for my overall program, so I can not switch to Python3. 我依靠Python2来完成整个程序,所以我无法切换到Python3。

array = [67, 97, 102, **-61, -87**, 32, 70, 108, 111, 114, 97] 

-> Caf é Flora - >咖啡馆é区系

Since every character in the string I want is not necessarily represented by exactly 1 byte in the array, I can not use a solution like: 由于我想要的字符串中的每个字符不一定由数组中的1个字节表示,我不能使用如下解决方案:

"".join(map(chr, array))

I tried to create a function that would step through the array, and whenever it encounters a number not in the range 0-127 (ASCII), create a new 16 bit int, shift the current bits over 8 to the left, and then add the following byte using a bitwise OR. 我尝试创建一个可以逐步执行数组的函数,每当遇到不在0-127(ASCII)范围内的数字时,创建一个新的16位int,将当前位移到8左边,然后添加使用按位OR的后续字节。 Finally it would use unichr() to decode it. 最后,它将使用unichr()来解码它。

result = []


for i in range(len(byte_array)):
    x = byte_array[i]
    if x < 0:
        b16 = x & 0xFFFF # 16 bit
        b16 = b16 << 8
        b16 = b16 | byte_array[i+1]
        result.append(unichr(m16))
    else:
        result.append(chr(x))

return "".join(result)

However, this was unsuccessful. 但是,这是不成功的。

The following article explains the issue very well, and includes a nodeJS solution: 以下文章很好地解释了该问题,并包含nodeJS解决方案:

http://ixti.net/development/node.js/2011/10/26/get-utf-8-string-from-array-of-bytes-in-node-js.html http://ixti.net/development/node.js/2011/10/26/get-utf-8-string-from-array-of-bytes-in-node-js.html

Use the little-used array module to convert your input to a bytestring and then decode it with the UTF-8 codec: 使用很少使用的array模块到你的输入转换为字节串,然后decode它使用UTF-8编码解码器:

import array
decoded = array.array('b', your_input).tostring().decode('utf-8')

You have to have in mind that a "string" in Python2 is not proper text, just a sequence of bytes in memory, which happens to map to characters when you "print" them - if the mapping of the intend characters in the byte sequence matches the one in the terminal, you will see properly formatted text. 你必须要记住,Python2中的“字符串”不是正确的文本,只是内存中的一个字节序列,当你“打印”它们时恰好映射到字符 - 如果字节序列中的意图字符的映射匹配终端中的那个,您将看到格式正确的文本。

If your terminal is not UTF-8, even if you get the proper byte-strign in memory, just printing it would show you the wrong results. 如果你的终端不是UTF-8,即使你在内存中得到正确的字节标记,只要打印它就会显示错误的结果。 That is why the extra "decode" step is needed at the end of the expression. 这就是为什么在表达式结束时需要额外的“解码”步骤。

text = b''.join(chr(i if i > 0 else 256 + i) for i in array).decode('utf-8')

As your source encoded the numbers between 128 and 255 as negative numbers, we have the inline "if" operator to renormalize the value before calling "chr". 由于您的源将128到255之间的数字编码为负数,因此我们在调用“chr”之前使用内联“if”运算符重新规范化该值。

Just to be clear - you say "Since every character in the string I want is not necessarily represented by exactly 1 byte in the array," - So - what takes care of that if you use Python2.x strings, is the terminal anyway. 只是要清楚 - 你说“因为我想要的字符串中的每个字符都不一定由数组中的1个字节表示,” - 所以 - 如果你使用Python2.x字符串,那么无论如何都是终端 If you want to deal with proper tet, after joining your numbers to a proper (byte) string, is to use the "decode" method - this is the part that will know about UTF-8 multi-byte encoded characters and give you back a (text) string object (an 'unicode' object in Python 2) - that will treat each character as an entity. 如果你想处理正确的tet,在将你的数字加到一个合适的(字节)字符串之后,就是使用“decode”方法 - 这是知道UTF-8多字节编码字符的部分并且还给你回来一个(文本)字符串对象(Python 2中的'unicode'对象) - 将每个字符视为一个实体。

you can use struct.pack for this 你可以使用struct.pack

>>> a =  [67, 97, 102, -61, -87, 32, 70, 108, 111, 114, 97]
>>> struct.pack("b"*len(a),*a)
'Caf\xc3\xa9 Flora'
>>> print struct.pack("b"*len(a),*a).decode('utf8')
Café Flora

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python解码部分utf-8字节数组 - python decode partial utf-8 byte array 如何解码 JavaScript 中的 utf-8 编码字符串? - How can I decode an utf-8 encoded string in JavaScript? 如何用python解码utf-8的字符串代表? - How to decode string representative of utf-8 with python? 为什么我可以在没有任何 UnicodeEncodeError/UnicodeDecodeError 的情况下将 UTF-8 字节字符串解码为 ISO8859-1 并再次返回? - How come I can decode a UTF-8 byte string to ISO8859-1 and back again without any UnicodeEncodeError/UnicodeDecodeError? 我如何解码这个utf-8字符串,在随机网站上挑选并由Django ORM使用Python保存? - How can I decode this utf-8 string, picked on a random website and saved by the Django ORM, using Python? UnicodeDecodeError:“ utf-8”编解码器无法解码字节(python) - UnicodeDecodeError: 'utf-8' codec can't decode byte (python) “utf-8”编解码器无法解码字节 - Python - 'utf-8' codec can't decode byte - Python Python 'utf-8' 编解码器无法解码字节 0xe0 - Python 'utf-8' codec can't decode byte 0xe0 Python UnicodeDecodeError: 'utf-8' 编解码器无法解码字节 - Python UnicodeDecodeError: 'utf-8' codec can't decode byte 为什么我不能使用 utf-8 解码任何字节? - Why can't I decode any byte using utf-8?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM