简体   繁体   中英

Handle wrongly encoded character in Python unicode string

I am dealing with unicode strings returned by the python-lastfm library.

I assume somewhere on the way, the library gets the encoding wrong and returns a unicode string that may contain invalid characters.

For example, the original string i am expecting in the variable a is "Glück"

>>> a
u'Gl\xfcck'
>>> print a
Traceback (most recent call last):
  File "", line 1, in 
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 2: ordinal not in range(128)

\\xfc is the escaped value 252, which corresponds to the latin1 encoding of "ü". Somehow this gets embedded in the unicode string in a way python can't handle on its own.

How do i convert this back a normal or unicode string that contains the original "Glück"? I tried playing around with the decode/encode methods, but either got a UnicodeEncodeError, or a string containing the sequence \\xfc.

You have to convert your unicode string into a standard string using some encoding eg utf-8:

some_unicode_string.encode('utf-8')

Apart from that: this is a dupe of

BeautifulSoup findall with class attribute- unicode encode error

and at least ten other related questions on SO. Research first.

Your unicode string is fine:

>>> unicodedata.name(u"\xfc")
'LATIN SMALL LETTER U WITH DIAERESIS'

The problem you see at the interactive prompt is that the interpreter doesn't know what encoding to use to output the string to your terminal, so it falls back to the "ascii" codec -- but that codec only knows how to deal with ASCII characters. It works fine on my machine (because sys.stdout.encoding is "UTF-8" for me -- likely because something like my environment variable settings differ from yours)

>>> print u'Gl\xfcck'
Glück

At the beginning of your code, just after imports, add these 3 lines.

import sys  # import sys package, if not already imported
reload(sys)
sys.setdefaultencoding('utf-8')

It will override system default encoding (ascii) for the course of your program.

Edit: You shouldn't do this unless you are sure of the consequences, see comment below. This post is also helpful: Dangers of sys.setdefaultencoding('utf-8')

Do not str() cast to string what you've got from model fields, as long as it is an unicode string already. (oops I have totally missed that it is not django-related)

I stumble upon this bug myself while processing a file containing german words that I was unaware it has been encoded in UTF-8. The problem manifest itself when I start processing words and some of them would't show the decoding error.

# python
Python 2.7.12 (default, Aug 22 2019, 16:36:40) 
>>> utf8_word = u"Gl\xfcck"
>>> print("Word read was: {}".format(utf8_word))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 2: ordinal not in range(128)

I solve the error calling the encode method on the string:

>>> print("Word read was: {}".format(utf8_word.encode('utf-8')))
Word read was: Glück

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM