将Python的3字节字符串转换为`str（utf8_encoded_str）`返回unicode

Question

Well, let me introduce the problem first. 好吧，让我先介绍一下这个问题。

I've got some data via POST/GET requests. 我通过POST / GET请求获得了一些数据。 The data were UTF-8 encoded string. 数据是UTF-8编码的字符串。 Little did I know that, and converted it just by str() method. 我几乎不知道，只是通过str()方法转换它。 And now I have full database of "nonsense data" and couldn't find a way back. 现在我有完整的“无意义数据”数据库，无法找到回路。

Example code: 示例代码：

unicode_str - this is the string I should obtain unicode_str - 这是我应该获得的字符串

encoded_str - this is the string I got with POST/GET requests - initial data encoded_str - 这是我用POST / GET请求得到的字符串 - 初始数据

bad_str - the data I have in the Database at the moment and I need to get unicode from. bad_str - 我目前在数据库中拥有的数据，我需要从中获取unicode。

So apparently I know how to convert: unicode_str =( encode )=> encoded_str =( str )=> bad_str 显然我知道如何转换： unicode_str =（ encode ）=> encoded_str =（ str ）=> bad_str

But I couldn't come up with solution back: bad_str =( ??? )=> encoded_str =( decode )=> unicode_str 但我无法提出解决方案： bad_str =（ ??? ）=> encoded_str =（ decode ）=> unicode_str

In [1]: unicode_str = 'Příliš žluťoučký kůň úpěl ďábelské ódy'

In [2]: unicode_str
Out[2]: 'Příliš žluťoučký kůň úpěl ďábelské ódy'

In [3]: encoded_str = unicode_str.encode("UTF-8")

In [4]: encoded_str
Out[4]: b'P\xc5\x99\xc3\xadli\xc5\xa1 \xc5\xbelu\xc5\xa5ou\xc4\x8dk\xc3\xbd k\xc5\xaf\xc5\x88 \xc3\xbap\xc4\x9bl \xc4\x8f\xc3\xa1belsk\xc3\xa9 \xc3\xb3dy'

In [5]: bad_str = str(encoded_str)

In [6]: bad_str
Out[6]: "b'P\\xc5\\x99\\xc3\\xadli\\xc5\\xa1 \\xc5\\xbelu\\xc5\\xa5ou\\xc4\\x8dk\\xc3\\xbd k\\xc5\\xaf\\xc5\\x88 \\xc3\\xbap\\xc4\\x9bl \\xc4\\x8f\\xc3\\xa1belsk\\xc3\\xa9 \\xc3\\xb3dy'"

In [7]: new_encoded_str = some_magical_function_here(bad_str) ???

Answer 1

You turned a bytes object to a string, which is just a representation of the bytes object. 您将一个bytes对象转换为一个字符串，它只是bytes对象的表示。 You can obtain the original bytes object by using ast.literal_eval() (credits to Mark Tolonen for the suggestion), then a simple decode() will do the job. 您可以使用ast.literal_eval()获取原始字节对象（Mark Tolonen为建议提供信用），然后使用简单的decode()来完成工作。

>>> import ast
>>> ast.literal_eval(bad_str).decode('utf-8')
'Příliš žluťoučký kůň úpěl ďábelské ódy'

Since you were the one who generated the strings, using eval() would be safe, but why not be safer? 既然你是生成字符串的人，使用eval()会很安全，但为什么不安全呢？

Answer 2

Please do not use eval, instead: 请不要使用eval，而是：

import codecs
s = 'žluťoučký'
x = str(s.encode('utf-8'))

# strip quotes
x = x[2:-1]

# unescape
x = codecs.escape_decode(x)[0].decode('utf-8')

# profit
x == s

将Python的3字节字符串转换为`str（utf8_encoded_str）`返回unicode

问题描述

Example code: 示例代码：

2 个解决方案

解决方案1
12 已采纳 2017-11-16 12:31:17

解决方案2
3 2017-11-17 14:03:33

将Python的3字节字符串转换为`str（utf8_encoded_str）`返回unicode

问题描述

Example code: 示例代码：

2 个解决方案

解决方案1 12 已采纳 2017-11-16 12:31:17

解决方案2 3 2017-11-17 14:03:33

解决方案1
12 已采纳 2017-11-16 12:31:17

解决方案2
3 2017-11-17 14:03:33