Reading file, outputting UTF-8/Unicode

Question

I'm trying to read a file and compare the characters. However, when I print out after reading each line with:

    with open('Q1.txt') as f:
        content = f.read().splitlines()

I'm getting '\\x80', '\\xe2', '\\x9d', etc.

What do these mean and how can I get rid of them?

Thanks.

Answer 1

The open() builtin does not handle any text encoding in Python 2.x. Multi-byte characters come through as raw hex. You can use the io module to get a more capable open function that provides a parameter to define the encoding:

import io
with io.open(fname, 'r', encoding='utf-8') as f:
  ...

Conveniently, this works in both python 2.6+ and 3.x so you won't have mysterious encoding problems if the code is ported to py3k later. BTW, the open builtin in 3.x is actually an alias for io.open . The backported version in 2.6+ has the exact same functionality. The io module is intended to supersede the codecs module and has some internal improvements so it is preferable to use its open in new code.

Answer 2

From the Unicode How-to docs: https://docs.python.org/2/howto/unicode.html

import codecs
f = codecs.open('Q1.txt', encoding='utf-8')
for line in f:
    print(repr(line))

In Python 3, just use the builtin open with the context manager:

with open('Q1.txt', encoding='utf-8') as f:
    for line in f:
        print(f)

Reading file, outputting UTF-8/Unicode

Question

2 answers

solution1
1 2014-07-12 03:53:41

solution2
1 2014-07-12 03:59:03

Reading file, outputting UTF-8/Unicode

Question

2 answers

solution1 1 2014-07-12 03:53:41

solution2 1 2014-07-12 03:59:03

solution1
1 2014-07-12 03:53:41

solution2
1 2014-07-12 03:59:03