简体   繁体   English

Python - Unicode 到 ASCII 的转换

[英]Python - Unicode to ASCII conversion

I am unable to convert the following Unicode to ASCII without losing data:我无法在不丢失数据的情况下将以下 Unicode 转换为 ASCII:

u'ABRA\xc3O JOS\xc9'

I tried encode and decode and they won't do it.我尝试encodedecode ,但他们不会这样做。

Does anyone have a suggestion?有人有建议吗?

The Unicode characters u'\\xce0' and u'\\xc9' do not have any corresponding ASCII values. Unicode 字符u'\\xce0'u'\\xc9'没有任何对应的 ASCII 值。 So, if you don't want to lose data, you have to encode that data in some way that's valid as ASCII.因此,如果您不想丢失数据,则必须以某种有效的 ASCII 方式对该数据进行编码。 Options include:选项包括:

>>> print s.encode('ascii', errors='backslashreplace')
ABRA\xc3O JOS\xc9
>>> print s.encode('ascii', errors='xmlcharrefreplace')
ABRAÃO JOSÉ
>>> print s.encode('unicode-escape')
ABRA\xc3O JOS\xc9
>>> print s.encode('punycode')
ABRAO JOS-jta5e

All of these are ASCII strings, and contain all of the information from your original Unicode string (so they can all be reversed without loss of data), but none of them are all that pretty for an end-user (and none of them can be reversed just by decode('ascii') ).所有这些都是 ASCII 字符串,并包含来自原始 Unicode 字符串的所有信息(因此它们都可以在不丢失数据的情况下反转),但是对于最终用户来说,它们都不是那么漂亮(并且它们都不能只需通过decode('ascii')反转。

Seestr.encode , Python Specific Encodings , and Unicode HOWTO for more info.有关更多信息,请参阅str.encodePython 特定编码Unicode HOWTO


As a side note, when some people say "ASCII", they really don't mean "ASCII" but rather "any 8-bit character set that's a superset of ASCII" or "some particular 8-bit character set that I have in mind".作为旁注,当有些人说“ASCII”时,他们的意思并不是“ASCII”,而是“作为 ASCII 超集的任何 8 位字符集”或“我所拥有的某些特定的 8 位字符集心”。 If that's what you meant, the solution is to encode to the right 8-bit character set:如果这就是您的意思,那么解决方案是编码为正确的 8 位字符集:

>>> s.encode('utf-8')
'ABRA\xc3\x83O JOS\xc3\x89'
>>> s.encode('cp1252')
'ABRA\xc3O JOS\xc9'
>>> s.encode('iso-8859-15')
'ABRA\xc3O JOS\xc9'

The hard part is knowing which character set you meant.困难的部分是知道您指的是哪个字符集。 If you're writing both the code that produces the 8-bit strings and the code that consumes it, and you don't know any better, you meant UTF-8.如果您同时编写产生 8 位字符串的代码和使用它的代码,并且您不知道更好,那么您的意思是 UTF-8。 If the code that consumes the 8-bit strings is, say, the open function or a web browser that you're serving a page to or something else, things are more complicated, and there's no easy answer without a lot more information.如果使用 8 位字符串的代码是open函数或您正在向其提供页面的 Web 浏览器或其他东西,则事情会更加复杂,如果没有更多信息,就没有简单的答案。

I needed to calculate the MD5 hash of a unicode string received in HTTP request .我需要计算在HTTP request收到的unicode stringMD5 hash MD5 was giving UnicodeEncodeError and python built-in encoding methods didn't work because it replaces the characters in the string with corresponding hex values for the characters thus changing the MD5 hash . MD5 给出UnicodeEncodeError并且 python 内置编码方法不起作用,因为它将字符串中的字符替换为字符的相应hex values ,从而更改MD5 hash So I came up with the following code, which keeps the string intact while converting from unicode .所以我想出了以下代码,它在从unicode转换时保持字符串完整。

unicode_string = ''.join([chr(ord(x)) for x in unicode_string]).strip()

This removes the unicode part from the string and keeps all the data intact.这会从字符串中删除unicode部分并保持所有数据完整无缺。

I found https://pypi.org/project/Unidecode/ this library very useful我发现https://pypi.org/project/Unidecode/这个库非常有用

>>> from unidecode import unidecode
>>> unidecode('ko\u017eu\u0161\u010dek')
'kozuscek'
>>> unidecode('30 \U0001d5c4\U0001d5c6/\U0001d5c1')
'30 km/h'
>>> unidecode('\u5317\u4EB0')
'Bei Jing '

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM