简体   繁体   中英

Python - Unicode to ASCII conversion

I am unable to convert the following Unicode to ASCII without losing data:

u'ABRA\xc3O JOS\xc9'

I tried encode and decode and they won't do it.

Does anyone have a suggestion?

The Unicode characters u'\\xce0' and u'\\xc9' do not have any corresponding ASCII values. So, if you don't want to lose data, you have to encode that data in some way that's valid as ASCII. Options include:

>>> print s.encode('ascii', errors='backslashreplace')
ABRA\xc3O JOS\xc9
>>> print s.encode('ascii', errors='xmlcharrefreplace')
ABRAÃO JOSÉ
>>> print s.encode('unicode-escape')
ABRA\xc3O JOS\xc9
>>> print s.encode('punycode')
ABRAO JOS-jta5e

All of these are ASCII strings, and contain all of the information from your original Unicode string (so they can all be reversed without loss of data), but none of them are all that pretty for an end-user (and none of them can be reversed just by decode('ascii') ).

Seestr.encode , Python Specific Encodings , and Unicode HOWTO for more info.


As a side note, when some people say "ASCII", they really don't mean "ASCII" but rather "any 8-bit character set that's a superset of ASCII" or "some particular 8-bit character set that I have in mind". If that's what you meant, the solution is to encode to the right 8-bit character set:

>>> s.encode('utf-8')
'ABRA\xc3\x83O JOS\xc3\x89'
>>> s.encode('cp1252')
'ABRA\xc3O JOS\xc9'
>>> s.encode('iso-8859-15')
'ABRA\xc3O JOS\xc9'

The hard part is knowing which character set you meant. If you're writing both the code that produces the 8-bit strings and the code that consumes it, and you don't know any better, you meant UTF-8. If the code that consumes the 8-bit strings is, say, the open function or a web browser that you're serving a page to or something else, things are more complicated, and there's no easy answer without a lot more information.

I needed to calculate the MD5 hash of a unicode string received in HTTP request . MD5 was giving UnicodeEncodeError and python built-in encoding methods didn't work because it replaces the characters in the string with corresponding hex values for the characters thus changing the MD5 hash . So I came up with the following code, which keeps the string intact while converting from unicode .

unicode_string = ''.join([chr(ord(x)) for x in unicode_string]).strip()

This removes the unicode part from the string and keeps all the data intact.

I found https://pypi.org/project/Unidecode/ this library very useful

>>> from unidecode import unidecode
>>> unidecode('ko\u017eu\u0161\u010dek')
'kozuscek'
>>> unidecode('30 \U0001d5c4\U0001d5c6/\U0001d5c1')
'30 km/h'
>>> unidecode('\u5317\u4EB0')
'Bei Jing '

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM