简体   繁体   中英

Python Polish character encoding issues

I'm having some issues with character encoding, and in this special case with Polish characters.

I need to replace all none windows-1252 characters with a windows-1252 equivalent. I had this working until I needed to work with Polish characters. How can I replace these characters?

The é for example is a windows-1252 character and must stay this way. But the ł is not a windows-1252 character and must be replaced with its equivalent (or stripped if it hasn't a equivalent).

I tried this:

import unicodedata

text = "Racławicka Rógé"
tmp = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore')
print(tmp.decode("utf-8"))

This prints:

Racawicka Roge

But now the ó and é are both encoded to o and e .

How can I get this right?

If you want to move to 1252 , that's what you should tell encode and decode :

>>> text = "Racławicka Rógé"
>>> text.encode('1252', 'ignore').decode('1252')
'Racawicka Rógé'

If you are not handling with big texts, just like your example, you can make use of Unidecode library with the solution provided by jonrsharpe .

from unidecode import unidecode

text = u'Racławicka Rógé'
result = ''

for i in text:
    try:
        result += i.encode('1252').decode('1252')
    except (UnicodeEncodeError, UnicodeDecodeError):
        result += unidecode(i)

print result # which will be 'Raclawicka Rógé'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM