在 Python 3 中将 Unicode 序列转换为字符串

Question

In parsing an HTML response to extract data with Python 3.4 on Kubuntu 15.10 in the Bash CLI, using print() I am getting output that looks like this:在 Bash CLI 中使用 Kubuntu 15.10 上的Python 3.4解析 HTML 响应以提取数据时，使用print()我得到如下所示的输出：

\u05ea\u05d4 \u05e0\u05e9\u05de\u05e2 \u05de\u05e6\u05d5\u05d9\u05df

How would I output the actual text itself in my application?我将如何在我的应用程序中输出实际文本本身？

This is the code generating the string:这是生成字符串的代码：

response = requests.get(url)
messages = json.loads( extract_json(response.text) )

for k,v in messages.items():
    for message in v['foo']['bar']:
        print("\nFoobar: %s" % (message['body'],))

Here is the function which returns the JSON from the HTML page:这是从 HTML 页面返回 JSON 的函数：

def extract_json(input_):

    """
    Get the JSON out of a webpage.
    The line of interest looks like this:
    foobar = ["{\"name\":\"dotan\",\"age\":38}"]
    """

    for line in input_.split('\n'):
        if 'foobar' in line:
            return line[line.find('"')+1:-2].replace(r'\"',r'"')

    return None

In googling the issue, I've found quite a bit of information relating to Python 2 , however Python 3 has completely changed how strings and especially Unicode are handled in Python.在谷歌搜索这个问题时，我发现了很多与Python 2相关的信息，但是Python 3已经完全改变了字符串，尤其是 Unicode 在 Python 中的处理方式。

How can I convert the example string ( \ת ) to characters ( ת ) in Python 3?如何在 Python 3 中将示例字符串 ( \ת ) 转换为字符 ( ת )？

Addendum:附录：

Here is some information regarding message['body'] :以下是有关message['body']一些信息：

print(type(message['body']))
# Prints: <class 'str'>

print(message['body'])
# Prints: \u05ea\u05d4 \u05e0\u05e9\u05de\u05e2 \u05de\u05e6\u05d5\u05d9\u05df

print(repr(message['body']))
# Prints: '\\u05ea\u05d4 \\u05e0\\u05e9\\u05de\\u05e2 \\u05de\\u05e6\\u05d5\\u05d9\\u05df'

print(message['body'].encode().decode())
# Prints: \u05ea\u05d4 \u05e0\u05e9\u05de\u05e2 \u05de\u05e6\u05d5\u05d9\u05df

print(message['body'].encode().decode('unicode-escape'))
# Prints: תה נשמע מצוין

Note that the last line does work as expected, but it has a few issues:请注意，最后一行确实按预期工作，但有一些问题：

Decoding string literals with unicode-escape is the wrong thing as Python escapes are different to JSON escapes for many characters.使用 unicode-escape 解码字符串文字是错误的，因为 Python 转义与许多字符的 JSON 转义不同。 (Thank you bobince ) （谢谢博宾斯）
encode() relies on the default encoding, which is a bad thing.(Thank you bobince ) encode()依赖于默认编码，这是一件坏事。（谢谢bobince ）
The encode() fails on some newer Unicode characters, such as \?\? , with UnicodeEncodeError "surrogates not allowed". encode()在一些较新的 Unicode 字符上失败，例如\?\? ，带有 UnicodeEncodeError “surrogates not allowed”。

Answer 1

It appears your input uses backslash as an escape character, you should unescape the text before passing it to json :看来您的输入使用反斜杠作为转义字符，您应该在将文本传递给json之前取消转义文本：

>>> foobar = '{\\"body\\": \\"\\\\u05e9\\"}'
>>> import re
>>> json_text = re.sub(r'\\(.)', r'\1', foobar) # unescape
>>> import json
>>> print(json.loads(json_text)['body'])
ש

Don't use 'unicode-escape' encoding on JSON text;不要在 JSON 文本上使用'unicode-escape'编码； it may produce different results:它可能会产生不同的结果：

>>> import json
>>> json_text = '["\\ud83d\\ude02"]'
>>> json.loads(json_text)
['😂']
>>> json_text.encode('ascii', 'strict').decode('unicode-escape') #XXX don't do it
'["\ud83d\ude02"]'

'😂' == '\\U0001F602' is U+1F602 (FACE WITH TEARS OF JOY) . '😂' == '\\U0001F602'是U+1F602（ '😂' == '\\U0001F602'喜悦）。

在 Python 3 中将 Unicode 序列转换为字符串

问题描述

1 个解决方案

解决方案1
2 已采纳 2015-11-03 18:37:47

在 Python 3 中将 Unicode 序列转换为字符串

问题描述

1 个解决方案

解决方案1 2 已采纳 2015-11-03 18:37:47

解决方案1
2 已采纳 2015-11-03 18:37:47