Reading Unicode Files - Python3.2

Question

I'm trying to read some files using Python3.2, the some of the files may contain unicode while others do not.

When I try:

file = open(item_path + item, encoding="utf-8")
for line in file:
    print (repr(line))

I get the error:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 13-16: ordinal not in range(128)

I am following the documentation here: http://docs.python.org/release/3.0.1/howto/unicode.html

Why would Python be trying to encode to ascii at any point in this code?

Answer 1

The problem is that repr(line) in Python 3 returns also the Unicode string. It does not convert the above 128 characters to the ASCII escape sequences.

Use ascii(line) instead if you want to see the escape sequences.

Actually, the repr(line) is expected to return the string that if placed in a source code would produce the object with the same value. This way, the Python 3 behaviour is just fine as there is no need for ASCII escape sequences in the source files to express a string with more than ASCII characters. It is quite natural to use UTF-8 or some other Unicode encoding these day. The truth is that Python 2 produced the escape sequences for such characters.

Answer 2

What's your output encoding? If you remove the call to print() , does it start working?

I suspect you've got a non-UTF-8 locale, so Python is trying to encode repr(line) as ASCII as part of printing it.

To resolve the issue, you must either encode the string and print the byte array, or set your default encoding to something that can handle your strings (UTF-8 being the obvious choice).

Reading Unicode Files - Python3.2

Question

2 answers

solution1
3 2012-04-25 11:49:15

solution2
2 ACCPTED 2012-04-25 09:32:12

Reading Unicode Files - Python3.2

Question

2 answers

solution1 3 2012-04-25 11:49:15

solution2 2 ACCPTED 2012-04-25 09:32:12

solution1
3 2012-04-25 11:49:15

solution2
2 ACCPTED 2012-04-25 09:32:12