Python 2.7 UnicodeDecodeError: 'ascii' codec can't decode byte

Question

I've been parsing some docx files (UTF-8 encoded XML) with special characters (Czech alphabet). When I try to output to stdout, everything goes smoothly, but I'm unable to output data to the file,

Traceback (most recent call last):
File "./test.py", line 360, in
ofile.write(u'\\t\\t\\t\\t\\t\\n')
UnicodeEncodeError: 'ascii' codec can't encode character u'\\xed' in position 37: ordinal not in range(128)

Although I explicitly cast the word variable to unicode type ( type(word) returned unicode), I tried to encode it with .encode('utf-8) I'm still stuck with this error.

Here is a sample of the code as it looks now:

for word in word_list:
    word = unicode(word)
    #...
    ofile.write(u'\t\t\t\t\t<feat att="writtenForm" val="'+word+u'"/>\n')
    #...

I also tried the following:

for word in word_list:
    word = word.encode('utf-8')
    #...
    ofile.write(u'\t\t\t\t\t<feat att="writtenForm" val="'+word+u'"/>\n')
    #...

Even the combination of these two:

word = unicode(word)
word = word.encode('utf-8')

I was kind of desperate so I even tried to encode the word variable inside the ofile.write()

ofile.write(u'\t\t\t\t\t<feat att="writtenForm" val="'+word.encode('utf-8')+u'"/>\n')

I would appreciate any hints of what I'm doing wrong.

Answer 1

ofile is a bytestream, which you are writing a character string to. Therefore, it tries to handle your mistake by encoding to a byte string. This is only generally safe with ASCII characters. Since word contains non-ASCII characters, it fails:

>>> open('/dev/null', 'wb').write(u'ä')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0:
                    ordinal not in range(128)

Make ofile a text stream by opening the file with io.open , with a mode like 'wt' , and an explicit encoding:

>>> import io
>>> io.open('/dev/null', 'wt', encoding='utf-8').write(u'ä')
1L

Alternatively, you can also use codecs.open with pretty much the same interface, or encode all strings manually with encode .

Answer 2

Phihag's answer is correct. I just want to propose to convert the unicode to a byte-string manually with an explicit encoding:

ofile.write((u'\t\t\t\t\t<feat att="writtenForm" val="' +
             word + u'"/>\n').encode('utf-8'))

(Maybe you like to know how it's done using basic mechanisms instead of advanced wizardry and black magic like io.open .)

Answer 3

I've had a similar error when writing to word documents (.docx). Specifically with the Euro symbol (€).

x = "€".encode()

Which gave the error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

How I solved it was by:

x = "€".decode()

I hope this helps!

Answer 4

The best solution i found in stackoverflow is in this post: How to fix: "UnicodeDecodeError: 'ascii' codec can't decode byte" put in the beggining of the code and the default codification will be utf8

# encoding=utf8
import sys
reload(sys)
sys.setdefaultencoding('utf8')

Python 2.7 UnicodeDecodeError: 'ascii' codec can't decode byte

Question

4 answers

solution1
11 ACCPTED 2012-11-22 12:13:41

solution2
2 2012-11-22 12:32:59

solution3
2 2014-11-30 20:49:28

solution4
1 2016-11-14 12:59:54

Python 2.7 UnicodeDecodeError: 'ascii' codec can't decode byte

Question

4 answers

solution1 11 ACCPTED 2012-11-22 12:13:41

solution2 2 2012-11-22 12:32:59

solution3 2 2014-11-30 20:49:28

solution4 1 2016-11-14 12:59:54

solution1
11 ACCPTED 2012-11-22 12:13:41

solution2
2 2012-11-22 12:32:59

solution3
2 2014-11-30 20:49:28

solution4
1 2016-11-14 12:59:54