简体   繁体   中英

UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 11597: ordinal not in range(128)

I am using well known 20 Newsgroups data set for text categorisation in jupyter. When I try to open and read file on my Mac, it fails at decoding step. I tried to read the file in byte format, which works but I further need to work with it as string. I tried to encode it but it fails with the error.

Code

with open(file_path, 'rb') as f:
  file_read=f.read()
  file_read.decode("us-ascii")

Error

UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 11597: ordinal not in range(128)

Us-ascii is file encoding I found when typing in terminal: file -I file_name . I tried other encodings but none works. I further want to remove punctuation and count words in the file. Is there a way how to overcome this issue?

It is tricky without looking at the file. However this works most of the time

from codecs import open
file_path = "file_name"
with open(file_path, 'rb') as f:
  file_read=f.read()

Setting error to ignore resolved the problem, thanks N M. The code looks like:

ref_file=open(ref_file_path, 'r', encoding='ascii', errors='ignore') 
file_read=ref_file.read()

The code further treats it as a one big string. Note that although the error was about decoding 0xff it was not UTF-16 coding.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM