简体   繁体   English

Python unicode 重音 a (à) 十六进制

[英]Python unicode accent a (à) hex

I have a string from bs4 that is我有一个来自 bs4 的字符串

s = "vinili-disponibili/311-canzoniere-del-lazio-lassa-st\u00c3\u00a0-la-me-creatura.html"

\Ã\  should be accent a (à) I have gotten it to show up in the console partly correct as \Ã\ 应该是重音 a (à) 我已经让它在控制台中显示部分正确

vinili-disponibili/311-canzoniere-del-lazio-lassa-stà-la-me-creatura.html

with

str2 = u'%s' % s
print(str2.encode('utf-8').decode('unicode-escape'))

but it's decoding c3 and a0 separately, so I get a tilde A instead of an accent a.但它分别解码 c3 和 a0,所以我得到波浪号 A 而不是重音 a。 I know that c3 a0 is the hex utf-8 for accent a.我知道 c3 a0 是重音 a 的十六进制 utf-8。 I have no idea what's going on and I got to here using Google and the combinatory approach to the answers I got.我不知道发生了什么,我来到这里使用谷歌和我得到的答案的组合方法。 This entire character encoding thing seems like a big mess to me.整个字符编码对我来说似乎是一团糟。

The way it is supposed to be is它应该是这样的

311-canzoniere-del-lazio-lassa-stà-la-me-creatura.html

EDIT: Andrey's method worked when printing it out, but trying to use urlopen with the string I get UnicodeEncodeError: 'ascii' codec can't encode character '\\xe0' in position 60: ordinal not in range(128)编辑:Andrey 的方法在打印出来时有效,但尝试将 urlopen 与字符串一起使用我得到UnicodeEncodeError: 'ascii' codec can't encode character '\\xe0' in position 60: ordinal not in range(128)

After using unquote(str,":/") it gives UnicodeEncodeError: 'ascii' codec can't encode characters in position 56-57: ordinal not in range(128) .使用unquote(str,":/")它给出UnicodeEncodeError: 'ascii' codec can't encode characters in position 56-57: ordinal not in range(128)

Assuming Python 2:假设 Python 2:

This is a byte string with Unicode escapes.这是一个带有 Unicode 转义的字节字符串。 The Unicode escapes were incorrectly generated for some UTF-8-encoded data:某些 UTF-8 编码的数据错误地生成了 Unicode 转义符:

>>> s = "vinili-disponibili/311-canzoniere-del-lazio-lassa-st\u00c3\u00a0-la-me-creatura.html"
>>> s.decode('unicode-escape')
u'vinili-disponibili/311-canzoniere-del-lazio-lassa-st\xc3\xa0-la-me-creatura.html'

Now it is a Unicode string but now appears mis-decoded since the code points resemble UTF-8 bytes.现在它是一个 Unicode 字符串,但由于代码点类似于 UTF-8 字节,因此现在似乎被错误解码。 It turns output the latin1 (also iso-8859-1 ) codec maps the first 256 code points directly to bytes 0-255, so use this trick to convert back to a byte string:它将输出latin1 (也是iso-8859-1 )编解码器将前 256 个代码点直接映射到字节 0-255,因此使用此技巧将其转换回字节字符串:

>>> s.decode('unicode-escape').encode('latin1')
'vinili-disponibili/311-canzoniere-del-lazio-lassa-st\xc3\xa0-la-me-creatura.html'

Now it can be decoded correctly as UTF-8:现在它可以正确解码为 UTF-8:

>>> s.decode('unicode-escape').encode('latin1').decode('utf8')
u'vinili-disponibili/311-canzoniere-del-lazio-lassa-st\xe0-la-me-creatura.html'

It is a Unicode string, so Python displays its repr() value, which shows code points above U+007F as escape codes.它是一个 Unicode 字符串,因此 Python 显示其repr()值,它将 U+007F 以上的代码点显示为转义码。 print it to see the actual value assuming your terminal is correctly configured with an encoding that supports the characters printed: print它以查看实际值,假设您的终端正确配置了支持打印字符的编码:

>>> print(s.decode('unicode-escape').encode('latin1').decode('utf8'))
vinili-disponibili/311-canzoniere-del-lazio-lassa-stà-la-me-creatura.html

Ideally, fix the problem that generated this string incorrectly in the first place instead of working around the mess.理想情况下,首先解决生成此字符串不正确的问题,而不是解决混乱的问题。

Transform the string back into bytes using .encode('latin-1') , then decode the unicode-escapes \\u\u003c/code> , transform everything into bytes again using the "wrong" 'latin-1' encoding, and finally, decode "properly" as 'utf-8' :使用.encode('latin-1')将字符串转换回字节,然后解码 unicode-escapes \\u\u003c/code> ,使用“错误”的'latin-1'编码再次将所有内容转换为字节,最后,“正确”解码作为'utf-8'

s = "vinili-disponibili/311-canzoniere-del-lazio-lassa-st\u00c3\u00a0-la-me-creatura.html"
s.encode('latin-1').decode('raw_unicode_escape').encode('latin-1').decode('utf-8')

gives:给出:

'vinili-disponibili/311-canzoniere-del-lazio-lassa-stà-la-me-creatura.html'

It works for the same reason as explained in this answer .它的工作原因与本答案中解释的相同。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM