I'm trying to read a file and compare the characters. However, when I print out after reading each line with:
with open('Q1.txt') as f:
content = f.read().splitlines()
I'm getting '\\x80', '\\xe2', '\\x9d', etc.
What do these mean and how can I get rid of them?
Thanks.
The open()
builtin does not handle any text encoding in Python 2.x. Multi-byte characters come through as raw hex. You can use the io
module to get a more capable open
function that provides a parameter to define the encoding:
import io
with io.open(fname, 'r', encoding='utf-8') as f:
...
Conveniently, this works in both python 2.6+ and 3.x so you won't have mysterious encoding problems if the code is ported to py3k later. BTW, the open
builtin in 3.x is actually an alias for io.open
. The backported version in 2.6+ has the exact same functionality. The io
module is intended to supersede the codecs
module and has some internal improvements so it is preferable to use its open
in new code.
From the Unicode How-to docs: https://docs.python.org/2/howto/unicode.html
import codecs
f = codecs.open('Q1.txt', encoding='utf-8')
for line in f:
print(repr(line))
In Python 3, just use the builtin open
with the context manager:
with open('Q1.txt', encoding='utf-8') as f:
for line in f:
print(f)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.