简体   繁体   English

将Python的3字节字符串转换为`str(utf8_encoded_str)`返回unicode

[英]Converting Python 3 String of Bytes of Unicode - `str(utf8_encoded_str)` back to unicode

Well, let me introduce the problem first. 好吧,让我先介绍一下这个问题。

I've got some data via POST/GET requests. 我通过POST / GET请求获得了一些数据。 The data were UTF-8 encoded string. 数据是UTF-8编码的字符串。 Little did I know that, and converted it just by str() method. 我几乎不知道,只是通过str()方法转换它。 And now I have full database of "nonsense data" and couldn't find a way back. 现在我有完整的“无意义数据”数据库,无法找到回路。

Example code: 示例代码:

unicode_str - this is the string I should obtain unicode_str - 这是我应该获得的字符串

encoded_str - this is the string I got with POST/GET requests - initial data encoded_str - 这是我用POST / GET请求得到的字符串 - 初始数据

bad_str - the data I have in the Database at the moment and I need to get unicode from. bad_str - 我目前在数据库中拥有的数据,我需要从中获取unicode。

So apparently I know how to convert: unicode_str =( encode )=> encoded_str =( str )=> bad_str 显然我知道如何转换: unicode_str =( encode )=> encoded_str =( str )=> bad_str

But I couldn't come up with solution back: bad_str =( ??? )=> encoded_str =( decode )=> unicode_str 但我无法提出解决方案: bad_str =( ??? )=> encoded_str =( decode )=> unicode_str

In [1]: unicode_str = 'Příliš žluťoučký kůň úpěl ďábelské ódy'

In [2]: unicode_str
Out[2]: 'Příliš žluťoučký kůň úpěl ďábelské ódy'

In [3]: encoded_str = unicode_str.encode("UTF-8")

In [4]: encoded_str
Out[4]: b'P\xc5\x99\xc3\xadli\xc5\xa1 \xc5\xbelu\xc5\xa5ou\xc4\x8dk\xc3\xbd k\xc5\xaf\xc5\x88 \xc3\xbap\xc4\x9bl \xc4\x8f\xc3\xa1belsk\xc3\xa9 \xc3\xb3dy'

In [5]: bad_str = str(encoded_str)

In [6]: bad_str
Out[6]: "b'P\\xc5\\x99\\xc3\\xadli\\xc5\\xa1 \\xc5\\xbelu\\xc5\\xa5ou\\xc4\\x8dk\\xc3\\xbd k\\xc5\\xaf\\xc5\\x88 \\xc3\\xbap\\xc4\\x9bl \\xc4\\x8f\\xc3\\xa1belsk\\xc3\\xa9 \\xc3\\xb3dy'"

In [7]: new_encoded_str = some_magical_function_here(bad_str) ???

You turned a bytes object to a string, which is just a representation of the bytes object. 您将一个bytes对象转换为一个字符串,它只是bytes对象的表示。 You can obtain the original bytes object by using ast.literal_eval() (credits to Mark Tolonen for the suggestion), then a simple decode() will do the job. 您可以使用ast.literal_eval()获取原始字节对象(Mark Tolonen为建议提供信用),然后使用简单的decode()来完成工作。

>>> import ast
>>> ast.literal_eval(bad_str).decode('utf-8')
'Příliš žluťoučký kůň úpěl ďábelské ódy'

Since you were the one who generated the strings, using eval() would be safe, but why not be safer? 既然你是生成字符串的人,使用eval()会很安全,但为什么不安全呢?

Please do not use eval, instead: 请不要使用eval,而是:

import codecs
s = 'žluťoučký'
x = str(s.encode('utf-8'))

# strip quotes
x = x[2:-1]

# unescape
x = codecs.escape_decode(x)[0].decode('utf-8')

# profit
x == s

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM