简体   繁体   中英

Why does Python3 get a UnicodeDecodeError reading a text file where Python2 does not?

I'm reading in a text file. I've been doing it just fine with python2, but I decided to run my code with python3 instead.

My code for reading the text file is:

neg_words = []
with open('negative-words.txt', 'r') as f:
    for word in f:
        neg_words.append(word)

When I run this code on python 3 I get the following error:

UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-14-1e2ff142b4c1> in <module>()
      3 pos_words = []
      4 with open('negative-words.txt', 'r') as f:
----> 5     for word in f:
      6         neg_words.append(word)
      7 with open('positive-words.txt', 'r') as f:

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/codecs.py in 
decode(self, input, final)
    319         # decode input (taking the buffer into account)
    320         data = self.buffer + input
--> 321         (result, consumed) = self._buffer_decode(data, self.errors, final)
    322         # keep undecoded input until the next call
    323         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xef in position 3988: invalid continuation byte

It seems to me that there is a certain form of text that python2 decodes without any issue, which python3 can't.

Could someone please explain what the difference is between python2 and python3 with respect to this error. Why does it occur in one version but not the other? How can I stop it?

Your file is not UTF-8 encoded. Figure out what encoding is used and specificy that explicitly when opening the file:

with open('negative-words.txt', 'r', encoding="<correct codec>") as f:

In Python 2, str is a binary string , containing encoded data, not Unicode text. If you were to use import io then io.open() , you'd get the same issues, or if you were to try to decode the data you read with word.decode('utf8') .

You probably want to read up on Unicode and Python. I strongly recommend Ned Batchelder's Pragmatic Unicode .

Or we can simply read file the under binary mode:

 with open(filename, 'rb') as f:
     pass

'r' open for reading (default)

'b' binary mode

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM