在 Python 中使用混合 UTF-8 编码解码响应

Question

I'm downloading data from a website using aiohttp and I'm getting a bytes object as a response but I'm having an hard time decoding it.我正在使用 aiohttp 从网站下载数据，我得到一个字节对象作为响应，但我很难解码它。 This is an example of the reponse I get这是我得到的回应的一个例子

b'\\r\\nLocalit\xc3\xa0' # Località
b'\\u003cdiv\\u003e12/09/2019\\u003c/div\\u003e\\r\\n' # <div>12/09/2019</div>

From what I understand it has normal unicode for text and escaped unicode for the html tags and line feed.据我了解，它具有用于文本的正常 unicode 和用于 html 标签和换行符的转义 unicode。 If I try to decode it using "str(content, "utf-8")" I still have the html tags in this format如果我尝试使用 "str(content, "utf-8")" 解码它，我仍然有这种格式的 html 标签

\u003cdiv \u003e12/09/2019\u003c/div\u003e\r\n

Should I just do a manual .replace("\\u003\u0026quot;, "<") for every tag or is there a more elegant solution?我应该为每个标签做一个手动.replace("\\u003\u0026quot;, "<")还是有更优雅的解决方案？

Answer 1

You could use the 'unicode-escape' codec to convert the unicode part, then reencode transparently to bytes (latin-1 is convenient for this, as is provides a 1-to-1 correspondance between bytes and chars), then decode as 'utf-8':您可以使用'unicode-escape'编解码器来转换 unicode 部分，然后透明地重新编码为字节（latin-1 对此很方便，因为它提供了字节和字符之间的 1 对 1 对应关系），然后解码为 ' utf-8'：

b = b'\\u003cdiv\\u003e12/09/2019\\u003c/div\\u003e\\r\\n\\r\\nLocalit\xc3\xa0'
b.decode('unicode-escape').encode('latin1').decode('utf8')
# '<div>12/09/2019</div>\r\n\r\nLocalità'

在 Python 中使用混合 UTF-8 编码解码响应

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-04-26 09:53:15

在 Python 中使用混合 UTF-8 编码解码响应

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-04-26 09:53:15

解决方案1
1 已采纳 2020-04-26 09:53:15