简体   繁体   English

将Bytearray编码为UTF-8

[英]Encode Bytearray into UTF-8

So, in Python 2.7 I have a string: 因此,在Python 2.7中,我有一个字符串:

Python 2.7.8 (default, Apr 15 2015, 09:26:43) 
[GCC 4.9.2 20150212 (Red Hat 4.9.2-6)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import scrypt
>>> s=scrypt.encrypt('somestring', 'test'.encode('ascii'), 0.1)
>>> s
'scrypt\x00\r\x00\x00\x00\x08\x00\x00\x00\x016 \xf2\xcc\xf9\xd2\xbe\xd4\xdbU!\xaf\xecKk{\x8b\r\x94\xe8\x11\xf2\x00\x1f\xd9\xceBhf$cM\x12{\xd8\x84\\\xf2j`\xba\xc5Xk\x196)\xf5\xd3\xe9\x15\xdd\xd3\xa0A_K\x00\x18\x03J\x85\xee\n\xcc\xea\x86\xda\xaa\xfd6E\xf4\x804\xfe\x04\xca\xec!\x94F\x84)B\tf\x07\xd9!@B,\x9e\xffc\xf2\xb6e\x8c\xa9HA\x98\x99\xa0\xe8\xcf\x85P2\x13\x0f\xa1\xf6\x90nO\x85Z\xb2\xc1'
>>> type(s)
<type 'str'>

(It's real ugly.) (真丑)

I need to encode it into text - either a unicode object or a utf-8 string. 我需要将其编码为文本-unicode对象或utf-8字符串。

TypeError: You are required to pass either a unicode object or a utf-8 string here.
You passed a Python string object which contained non-utf-8:
'scrypt\x00\r\x00\x00\x00\x08\x00\x00\x00\x01\xce\xf5\xba\x19\xeb1z/5*`m\xec\xf6sgT4\xb5.\xf7^\x96\xfaMY6\xa0\xdb\t\xa3*<5A<\xfb\xbe\xfb>w\xa3,MjaX;\xc1r\xdc\xbd\x04W\xafq3O\x90\x19!\x13\xe8\x0c\x86\xf5\xc96\xf4K\x16\xe3^.v\x8a\xe0\xda\xdd>#\xa7\\\x1c\xc2\x11\x85\x01\xb5\xd4\x92\xef\xa1k\x05Z\xaey\xd7M`%5.\x9f\xb1\xc4\x11N\xdeY\xa2\xac=\r\n\xb4aM\xfd)\xcc$\xbbq\xaa\xfd\x9d \xa5\xd39|\x85\xc8\x95\xbc\xfa\x17\xa1\x8e\xb8\x81 \xb4\x9b>j'.
The UnicodeDecodeError that resulted from attempting to interpret it as utf-8 was:
'utf8' codec can't decode byte 0xce in position 20: invalid continuation byte

The problem is, it's outside of the range of UTF-8: 问题是,它不在UTF-8的范围内:

>>> s.encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf2 in position 18: ordinal not in range(128)

So: how should I go about encoding this string? 所以:我应该如何编码此字符串?

Bonus points if you can tell me why the ascii codec is the one having an error there (and a UnicodeDecodeError of all things) when I'm trying to encode a string. 如果您能告诉我为什么当我尝试字符串进行编码时,为什么ascii编解码器是在那里出错(以及所有事物都出现UnicodeDecodeError )的地方,则有加分。

(For the record, trying to encode as UTF-16 throws the exact same error.) (为了便于记录,尝试编码为UTF-16会引发完全相同的错误。)

I've gotten it to work with base64 (which is, I suppose, what that's for) but I'm curious as to why I'm getting this error and what my options are. 我已经将它与base64一起使用(我想这是为了什么),但是我很好奇为什么会出现此错误以及我的选择是什么。

You have binary data. 您有二进制数据。 Not text, and certainly not Unicode. 不是文本,当然也不是Unicode。 You cannot encode this to UTF-8 as it is not a unicode (text) object. 您不能将此编码为UTF-8,因为它不是unicode (文本)对象。

Your UnicodeDecodeError is caused by Python trying to decode the data first; 您的UnicodeDecodeError是由Python尝试首先解码数据引起的; it is trying to be helpful because normally you can only encode from Unicode to bytes. 这样做很有帮助,因为通常您只能将Unicode编码为字节。 Since you tried to do this on bytes instead, it first needs to decode the bytes to Unicode, and it'll do that using the ASCII codec. 由于您尝试在字节上执行此操作,因此它首先需要将字节解码为Unicode,然后使用ASCII编解码器进行操作。 But you don't have ASCII data, nor any other text encoding. 但是您没有ASCII数据,也没有任何其他文本编码。

You cannot make Unicode out of those bytes because it is not text . 您不能从这些字节中提取Unicode,因为它不是text Your only option is to use a binary-to-text scheme like base64, which wraps binary data in a manner safe for transport through systems expecting text (and thus not supporting \\x00 NUL bytes or \\x0a newlines or other bytes that have special meaning in text streams. 您唯一的选择是使用类似于base64的二进制到文本方案,该方案以安全的方式包装二进制数据,以便在期望文本的系统中传输(因此不支持\\x00 NUL字节或\\x0a换行符或其他具有特殊含义的字节)在文字流中。

See the binascii library for various binary-to-text schemes available in the Python standard library; 有关Python标准库中可用的各种二进制到文本方案,请参见binascii库。 base64 is the most widely used of these. base64是其中使用最广泛的。

The general answer is that you cannot - your generic binary data may contain byte sequences that are simply not valid utf-8. 一般的答案是您不能-通用二进制数据可能包含根本无效的utf-8字节序列。 However, depending on your application, maybe you could use a binary-to-text encoding such as Base 64 to store the data wherever you need to, and then decode it upon retrieval? 但是,根据您的应用程序,也许您可​​以使用二进制到文本的编码(例如Base 64)将数据存储在需要的地方,然后在检索时对其进行解码?

Refs: https://en.wikipedia.org/wiki/Base64 参考: https : //en.wikipedia.org/wiki/Base64

https://docs.python.org/2/library/base64.html https://docs.python.org/2/library/base64.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM