简体   繁体   English

在 Python 3 中将 Unicode 序列转换为字符串

[英]Converting Unicode sequences to a string in Python 3

In parsing an HTML response to extract data with Python 3.4 on Kubuntu 15.10 in the Bash CLI, using print() I am getting output that looks like this:在 Bash CLI 中使用 Kubuntu 15.10 上的Python 3.4解析 HTML 响应以提取数据时,使用print()我得到如下所示的输出:

\u05ea\u05d4 \u05e0\u05e9\u05de\u05e2 \u05de\u05e6\u05d5\u05d9\u05df

How would I output the actual text itself in my application?我将如何在我的应用程序中输出实际文本本身?

This is the code generating the string:这是生成字符串的代码:

response = requests.get(url)
messages = json.loads( extract_json(response.text) )

for k,v in messages.items():
    for message in v['foo']['bar']:
        print("\nFoobar: %s" % (message['body'],))

Here is the function which returns the JSON from the HTML page:这是从 HTML 页面返回 JSON 的函数:

def extract_json(input_):

    """
    Get the JSON out of a webpage.
    The line of interest looks like this:
    foobar = ["{\"name\":\"dotan\",\"age\":38}"]
    """

    for line in input_.split('\n'):
        if 'foobar' in line:
            return line[line.find('"')+1:-2].replace(r'\"',r'"')

    return None

In googling the issue, I've found quite a bit of information relating to Python 2 , however Python 3 has completely changed how strings and especially Unicode are handled in Python.在谷歌搜索这个问题时,我发现了很多Python 2相关的信息,但是Python 3已经完全改变了字符串,尤其是 Unicode 在 Python 中的处理方式。

How can I convert the example string ( ) to characters ( ת ) in Python 3?如何在 Python 3 中将示例字符串 ( ) 转换为字符 ( ת )?

Addendum:附录:

Here is some information regarding message['body'] :以下是有关message['body']一些信息:

print(type(message['body']))
# Prints: <class 'str'>

print(message['body'])
# Prints: \u05ea\u05d4 \u05e0\u05e9\u05de\u05e2 \u05de\u05e6\u05d5\u05d9\u05df

print(repr(message['body']))
# Prints: '\\u05ea\u05d4 \\u05e0\\u05e9\\u05de\\u05e2 \\u05de\\u05e6\\u05d5\\u05d9\\u05df'

print(message['body'].encode().decode())
# Prints: \u05ea\u05d4 \u05e0\u05e9\u05de\u05e2 \u05de\u05e6\u05d5\u05d9\u05df

print(message['body'].encode().decode('unicode-escape'))
# Prints: תה נשמע מצוין

Note that the last line does work as expected, but it has a few issues:请注意,最后一行确实按预期工作,但有一些问题:

  • Decoding string literals with unicode-escape is the wrong thing as Python escapes are different to JSON escapes for many characters.使用 unicode-escape 解码字符串文字是错误的,因为 Python 转义与许多字符的 JSON 转义不同。 (Thank you bobince ) 谢谢博宾斯
  • encode() relies on the default encoding, which is a bad thing.(Thank you bobince ) encode()依赖于默认编码,这是一件坏事。(谢谢bobince
  • The encode() fails on some newer Unicode characters, such as \?\? , with UnicodeEncodeError "surrogates not allowed". encode()在一些较新的 Unicode 字符上失败,例如\?\? ,带有 UnicodeEncodeError “surrogates not allowed”。

It appears your input uses backslash as an escape character, you should unescape the text before passing it to json :看来您的输入使用反斜杠作为转义字符,您应该在将文本传递给json之前取消转义文本:

>>> foobar = '{\\"body\\": \\"\\\\u05e9\\"}'
>>> import re
>>> json_text = re.sub(r'\\(.)', r'\1', foobar) # unescape
>>> import json
>>> print(json.loads(json_text)['body'])
ש

Don't use 'unicode-escape' encoding on JSON text;不要在 JSON 文本上使用'unicode-escape'编码; it may produce different results:它可能会产生不同的结果:

>>> import json
>>> json_text = '["\\ud83d\\ude02"]'
>>> json.loads(json_text)
['😂']
>>> json_text.encode('ascii', 'strict').decode('unicode-escape') #XXX don't do it
'["\ud83d\ude02"]'

'😂' == '\\U0001F602' is U+1F602 (FACE WITH TEARS OF JOY) . '😂' == '\\U0001F602'U+1F602( '😂' == '\\U0001F602'喜悦)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM