Python Polish character encoding issues

Question

I'm having some issues with character encoding, and in this special case with Polish characters.

I need to replace all none windows-1252 characters with a windows-1252 equivalent. I had this working until I needed to work with Polish characters. How can I replace these characters?

The é for example is a windows-1252 character and must stay this way. But the ł is not a windows-1252 character and must be replaced with its equivalent (or stripped if it hasn't a equivalent).

I tried this:

import unicodedata

text = "Racławicka Rógé"
tmp = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore')
print(tmp.decode("utf-8"))

This prints:

Racawicka Roge

But now the ó and é are both encoded to o and e .

How can I get this right?

Answer 1

If you want to move to 1252 , that's what you should tell encode and decode :

>>> text = "Racławicka Rógé"
>>> text.encode('1252', 'ignore').decode('1252')
'Racawicka Rógé'

Answer 2

If you are not handling with big texts, just like your example, you can make use of Unidecode library with the solution provided by jonrsharpe .

from unidecode import unidecode

text = u'Racławicka Rógé'
result = ''

for i in text:
    try:
        result += i.encode('1252').decode('1252')
    except (UnicodeEncodeError, UnicodeDecodeError):
        result += unidecode(i)

print result # which will be 'Raclawicka Rógé'

Python Polish character encoding issues

Question

2 answers

solution1
3 ACCPTED 2014-12-04 15:30:08

solution2
0 2014-12-04 16:18:47

Python Polish character encoding issues

Question

2 answers

solution1 3 ACCPTED 2014-12-04 15:30:08

solution2 0 2014-12-04 16:18:47

solution1
3 ACCPTED 2014-12-04 15:30:08

solution2
0 2014-12-04 16:18:47