简体   繁体   中英

How to solve “UnicodeDecodeError: 'ascii' codec can't decode byte”

I am writing a program for counting the approximate number of words in the file and getting an error stating 'ascii' codec can't decode byte .

How can I eliminate this error?

Below is the traceback of above error:

Traceback (most recent call last):
  File "/Users/NikolaMac/Desktop/alice.py", line 23, in <module>
    contents = f_obj.read()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)"

Here is my code:

filename='alice.txt'

try:
    with open(filename) as f_obj:
        contents = f_obj.read()

except FileNotFoundError:
    msg = "Sorry, the file " + filename + " does not exist."
    print(msg)

else:
    # Count the approximate number of words in the file.
    words = contents.split()
    num_words = len(words)
    print("The file " + filename + " has about " + str(num_words) + " words.")

You need to use the io.open function instead, and pass it an encoding.

Try this:

import io

with io.open(filename, encoding='utf-8') as f_obj:
    contents = f_obj.read()

print('Words: %d'%len(contents.split(' ')))

The error message says that it tries to use ASCII decoding. You may need to specify a different encoding.

The only part of your program I can see where encoding can come in is the open call. According to the docs , if you don't pass in an encoding explicitly,

The default encoding is platform dependent (whatever locale.getpreferredencoding() returns)

Try passing in encoding='utf-8' to the open call.

I believe every encoding out there encodes the space character as 0x20 (out of experience, not with solid evidence). If all you need to do is count words, you can skip the decoding process by checking the number of 0x20 bytes in the file, then add 1 to it. This simple method will get you an approximate.

With that method, you might have to consider subtracting the number of spaces at the very beginning or end of the file, since that means there is no word surrounding that space. UTF-16 encodes space as 0x20 0x00 so there might be a null byte at the beginning or end of the file if the document starts or ends with a space. Also some encodings put a byte order mark at the beginning of the file, in which case the text doesn't start from the beginning.

You can't use regex with this method so it will not work if you want to parse documents in non-latin based languages.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM