[英]Python unicode accent a (à) hex
I have a string from bs4 that is我有一个来自 bs4 的字符串
s = "vinili-disponibili/311-canzoniere-del-lazio-lassa-st\u00c3\u00a0-la-me-creatura.html"
\Ã\
should be accent a (à) I have gotten it to show up in the console partly correct as \Ã\
应该是重音 a (à) 我已经让它在控制台中显示部分正确
vinili-disponibili/311-canzoniere-del-lazio-lassa-stà -la-me-creatura.html
with和
str2 = u'%s' % s
print(str2.encode('utf-8').decode('unicode-escape'))
but it's decoding c3 and a0 separately, so I get a tilde A instead of an accent a.但它分别解码 c3 和 a0,所以我得到波浪号 A 而不是重音 a。 I know that c3 a0 is the hex utf-8 for accent a.
我知道 c3 a0 是重音 a 的十六进制 utf-8。 I have no idea what's going on and I got to here using Google and the combinatory approach to the answers I got.
我不知道发生了什么,我来到这里使用谷歌和我得到的答案的组合方法。 This entire character encoding thing seems like a big mess to me.
整个字符编码对我来说似乎是一团糟。
The way it is supposed to be is它应该是这样的
311-canzoniere-del-lazio-lassa-stà-la-me-creatura.html
EDIT: Andrey's method worked when printing it out, but trying to use urlopen with the string I get UnicodeEncodeError: 'ascii' codec can't encode character '\\xe0' in position 60: ordinal not in range(128)
编辑:Andrey 的方法在打印出来时有效,但尝试将 urlopen 与字符串一起使用我得到
UnicodeEncodeError: 'ascii' codec can't encode character '\\xe0' in position 60: ordinal not in range(128)
After using unquote(str,":/")
it gives UnicodeEncodeError: 'ascii' codec can't encode characters in position 56-57: ordinal not in range(128)
.使用
unquote(str,":/")
它给出UnicodeEncodeError: 'ascii' codec can't encode characters in position 56-57: ordinal not in range(128)
。
Assuming Python 2:假设 Python 2:
This is a byte string with Unicode escapes.这是一个带有 Unicode 转义的字节字符串。 The Unicode escapes were incorrectly generated for some UTF-8-encoded data:
某些 UTF-8 编码的数据错误地生成了 Unicode 转义符:
>>> s = "vinili-disponibili/311-canzoniere-del-lazio-lassa-st\u00c3\u00a0-la-me-creatura.html"
>>> s.decode('unicode-escape')
u'vinili-disponibili/311-canzoniere-del-lazio-lassa-st\xc3\xa0-la-me-creatura.html'
Now it is a Unicode string but now appears mis-decoded since the code points resemble UTF-8 bytes.现在它是一个 Unicode 字符串,但由于代码点类似于 UTF-8 字节,因此现在似乎被错误解码。 It turns output the
latin1
(also iso-8859-1
) codec maps the first 256 code points directly to bytes 0-255, so use this trick to convert back to a byte string:它将输出
latin1
(也是iso-8859-1
)编解码器将前 256 个代码点直接映射到字节 0-255,因此使用此技巧将其转换回字节字符串:
>>> s.decode('unicode-escape').encode('latin1')
'vinili-disponibili/311-canzoniere-del-lazio-lassa-st\xc3\xa0-la-me-creatura.html'
Now it can be decoded correctly as UTF-8:现在它可以正确解码为 UTF-8:
>>> s.decode('unicode-escape').encode('latin1').decode('utf8')
u'vinili-disponibili/311-canzoniere-del-lazio-lassa-st\xe0-la-me-creatura.html'
It is a Unicode string, so Python displays its repr()
value, which shows code points above U+007F as escape codes.它是一个 Unicode 字符串,因此 Python 显示其
repr()
值,它将 U+007F 以上的代码点显示为转义码。 print
it to see the actual value assuming your terminal is correctly configured with an encoding that supports the characters printed: print
它以查看实际值,假设您的终端正确配置了支持打印字符的编码:
>>> print(s.decode('unicode-escape').encode('latin1').decode('utf8'))
vinili-disponibili/311-canzoniere-del-lazio-lassa-stà-la-me-creatura.html
Ideally, fix the problem that generated this string incorrectly in the first place instead of working around the mess.理想情况下,首先解决生成此字符串不正确的问题,而不是解决混乱的问题。
Transform the string back into bytes using .encode('latin-1')
, then decode the unicode-escapes \\u\u003c/code> , transform everything into bytes again using the "wrong"
'latin-1'
encoding, and finally, decode "properly" as 'utf-8'
:使用
.encode('latin-1')
将字符串转换回字节,然后解码 unicode-escapes \\u\u003c/code> ,使用“错误”的
'latin-1'
编码再次将所有内容转换为字节,最后,“正确”解码作为'utf-8'
:
s = "vinili-disponibili/311-canzoniere-del-lazio-lassa-st\u00c3\u00a0-la-me-creatura.html"
s.encode('latin-1').decode('raw_unicode_escape').encode('latin-1').decode('utf-8')
gives:
给出:
'vinili-disponibili/311-canzoniere-del-lazio-lassa-stà-la-me-creatura.html'
It works for the same reason as explained in this answer .
它的工作原因与本答案中解释的相同。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.