I'm having some issues with character encoding, and in this special case with Polish characters.
I need to replace all none windows-1252 characters with a windows-1252 equivalent. I had this working until I needed to work with Polish characters. How can I replace these characters?
The é
for example is a windows-1252 character and must stay this way. But the ł
is not a windows-1252 character and must be replaced with its equivalent (or stripped if it hasn't a equivalent).
I tried this:
import unicodedata
text = "Racławicka Rógé"
tmp = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore')
print(tmp.decode("utf-8"))
This prints:
Racawicka Roge
But now the ó
and é
are both encoded to o
and e
.
How can I get this right?
If you want to move to 1252
, that's what you should tell encode
and decode
:
>>> text = "Racławicka Rógé"
>>> text.encode('1252', 'ignore').decode('1252')
'Racawicka Rógé'
If you are not handling with big texts, just like your example, you can make use of Unidecode library with the solution provided by jonrsharpe .
from unidecode import unidecode
text = u'Racławicka Rógé'
result = ''
for i in text:
try:
result += i.encode('1252').decode('1252')
except (UnicodeEncodeError, UnicodeDecodeError):
result += unidecode(i)
print result # which will be 'Raclawicka Rógé'
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.