How do I print a list of strings, when I can't know the char encoding in advance?

Question

I am retrieving a list of names from a webservice using a client I've written in Python. Upon retrieving the list, I encode each name to unicode and then print each of them to stdout. When I get to the name "Ólafur Jóhann Ólafsson", I get the following error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: 
                    ordinal not in range(128)

Since I cannot know what the encoding will be, how do I convert all of these strings to unicode? Or can you suggest a better way to handle this problem?

Answer 1

The UnicodeDammit module from BeautifulSoup can automagically detect the encoding.

from BeautifulSoup import UnicodeDammit

u = UnicodeDammit("Ólafur Jóhann Ólafsson")

print u.unicode
print u.originalEncoding

Answer 2

This page may help you http://wiki.python.org/moin/PrintFails

The problem, I guess, is that you need to print those names to console. Do you really need it? or it's just a test environment? if you use console just for testing, you may switch to other tools like unit testing to check what values you exactly get.

Answer 3

First of all, you decode data to Unicode (the absence of encoding) when reading from a file, pipe, socket, terminal, etc.; and encode Unicode to an appropriate byte encoding when sending/persisting data. I suspect this is the root of your problem.

The web service should declare the encoding in the headers or data received. print normally automatically encodes Unicode to the terminal's encoding (discovered through sys.stdout.encoding ) or in absence of that just ascii . If the characters in your data are not supported by the target encoding, you'll get a UnicodeEncodeError .

Since that is not the error you received, you should post some code so we can see what your are doing. Most likely, you are encoding a byte string instead of decoding . Here's an example of this:

>>> data = '\xc2\xbd' # UTF-8 encoded 1/2 symbol.
>>> data.encode('cp437')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\dev\python\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

What I did here is call encode on a byte string. Since encode requires a Unicode string, Python used the default ascii encoding to decode the byte string to Unicode first, before encoding to cp437 .

Fix this by decoding instead of encoding the data, then print will encode to stdout automatically. As long as your terminal supports the characters in the data, it will display properly:

>>> import sys
>>> sys.stdout.encoding
'cp437'
>>> print data.decode('utf8') # implicit encode to sys.stdout.encoding
½
>>> print data.decode('utf8').encode('cp437') # explicit encode.
½

How do I print a list of strings, when I can't know the char encoding in advance?

Question

3 answers

solution1
1 2010-09-06 16:08:36

solution2
1 2010-09-06 20:10:02

solution3
1 ACCPTED 2010-09-07 04:19:04

How do I print a list of strings, when I can't know the char encoding in advance?

Question

3 answers

solution1 1 2010-09-06 16:08:36

solution2 1 2010-09-06 20:10:02

solution3 1 ACCPTED 2010-09-07 04:19:04

solution1
1 2010-09-06 16:08:36

solution2
1 2010-09-06 20:10:02

solution3
1 ACCPTED 2010-09-07 04:19:04