Python unicode 重音 a (à) 十六进制

Question

I have a string from bs4 that is我有一个来自 bs4 的字符串

s = "vinili-disponibili/311-canzoniere-del-lazio-lassa-st\u00c3\u00a0-la-me-creatura.html"

\Ã\ should be accent a (à) I have gotten it to show up in the console partly correct as \Ã\ 应该是重音 a (à) 我已经让它在控制台中显示部分正确

vinili-disponibili/311-canzoniere-del-lazio-lassa-stÃ -la-me-creatura.html

with和

str2 = u'%s' % s
print(str2.encode('utf-8').decode('unicode-escape'))

but it's decoding c3 and a0 separately, so I get a tilde A instead of an accent a.但它分别解码 c3 和 a0，所以我得到波浪号 A 而不是重音 a。 I know that c3 a0 is the hex utf-8 for accent a.我知道 c3 a0 是重音 a 的十六进制 utf-8。 I have no idea what's going on and I got to here using Google and the combinatory approach to the answers I got.我不知道发生了什么，我来到这里使用谷歌和我得到的答案的组合方法。 This entire character encoding thing seems like a big mess to me.整个字符编码对我来说似乎是一团糟。

The way it is supposed to be is它应该是这样的

311-canzoniere-del-lazio-lassa-stà-la-me-creatura.html

EDIT: Andrey's method worked when printing it out, but trying to use urlopen with the string I get UnicodeEncodeError: 'ascii' codec can't encode character '\\xe0' in position 60: ordinal not in range(128)编辑：Andrey 的方法在打印出来时有效，但尝试将 urlopen 与字符串一起使用我得到UnicodeEncodeError: 'ascii' codec can't encode character '\\xe0' in position 60: ordinal not in range(128)

After using unquote(str,":/") it gives UnicodeEncodeError: 'ascii' codec can't encode characters in position 56-57: ordinal not in range(128) .使用unquote(str,":/")它给出UnicodeEncodeError: 'ascii' codec can't encode characters in position 56-57: ordinal not in range(128) 。

Answer 1

Assuming Python 2:假设 Python 2：

This is a byte string with Unicode escapes.这是一个带有 Unicode 转义的字节字符串。 The Unicode escapes were incorrectly generated for some UTF-8-encoded data:某些 UTF-8 编码的数据错误地生成了 Unicode 转义符：

>>> s = "vinili-disponibili/311-canzoniere-del-lazio-lassa-st\u00c3\u00a0-la-me-creatura.html"
>>> s.decode('unicode-escape')
u'vinili-disponibili/311-canzoniere-del-lazio-lassa-st\xc3\xa0-la-me-creatura.html'

Now it is a Unicode string but now appears mis-decoded since the code points resemble UTF-8 bytes.现在它是一个 Unicode 字符串，但由于代码点类似于 UTF-8 字节，因此现在似乎被错误解码。 It turns output the latin1 (also iso-8859-1 ) codec maps the first 256 code points directly to bytes 0-255, so use this trick to convert back to a byte string:它将输出latin1 （也是iso-8859-1 ）编解码器将前 256 个代码点直接映射到字节 0-255，因此使用此技巧将其转换回字节字符串：

>>> s.decode('unicode-escape').encode('latin1')
'vinili-disponibili/311-canzoniere-del-lazio-lassa-st\xc3\xa0-la-me-creatura.html'

Now it can be decoded correctly as UTF-8:现在它可以正确解码为 UTF-8：

>>> s.decode('unicode-escape').encode('latin1').decode('utf8')
u'vinili-disponibili/311-canzoniere-del-lazio-lassa-st\xe0-la-me-creatura.html'

It is a Unicode string, so Python displays its repr() value, which shows code points above U+007F as escape codes.它是一个 Unicode 字符串，因此 Python 显示其repr()值，它将 U+007F 以上的代码点显示为转义码。 print it to see the actual value assuming your terminal is correctly configured with an encoding that supports the characters printed: print它以查看实际值，假设您的终端正确配置了支持打印字符的编码：

>>> print(s.decode('unicode-escape').encode('latin1').decode('utf8'))
vinili-disponibili/311-canzoniere-del-lazio-lassa-stà-la-me-creatura.html

Ideally, fix the problem that generated this string incorrectly in the first place instead of working around the mess.理想情况下，首先解决生成此字符串不正确的问题，而不是解决混乱的问题。

Answer 2

Transform the string back into bytes using .encode('latin-1') , then decode the unicode-escapes \\u\u003c/code> , transform everything into bytes again using the "wrong" 'latin-1' encoding, and finally, decode "properly" as 'utf-8' :使用.encode('latin-1')将字符串转换回字节，然后解码 unicode-escapes \\u\u003c/code> ，使用“错误”的'latin-1'编码再次将所有内容转换为字节，最后，“正确”解码作为'utf-8' ：

s = "vinili-disponibili/311-canzoniere-del-lazio-lassa-st\u00c3\u00a0-la-me-creatura.html"
s.encode('latin-1').decode('raw_unicode_escape').encode('latin-1').decode('utf-8')

gives:给出：

'vinili-disponibili/311-canzoniere-del-lazio-lassa-stà-la-me-creatura.html'

It works for the same reason as explained in this answer .它的工作原因与本答案中解释的相同。

Python unicode 重音 a (à) 十六进制

问题描述

2 个解决方案

解决方案1
1 2018-10-18 04:33:49

解决方案2
1 已采纳 2018-10-20 11:55:22

Python unicode 重音 a (à) 十六进制

问题描述

2 个解决方案

解决方案1 1 2018-10-18 04:33:49

解决方案2 1 已采纳 2018-10-20 11:55:22

解决方案1
1 2018-10-18 04:33:49

解决方案2
1 已采纳 2018-10-20 11:55:22