UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 11597: ordinal not in range(128)

Question

I am using well known 20 Newsgroups data set for text categorisation in jupyter. When I try to open and read file on my Mac, it fails at decoding step. I tried to read the file in byte format, which works but I further need to work with it as string. I tried to encode it but it fails with the error.

Code

with open(file_path, 'rb') as f:
  file_read=f.read()
  file_read.decode("us-ascii")

Error

UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 11597: ordinal not in range(128)

Us-ascii is file encoding I found when typing in terminal: file -I file_name . I tried other encodings but none works. I further want to remove punctuation and count words in the file. Is there a way how to overcome this issue?

Answer 1

It is tricky without looking at the file. However this works most of the time

from codecs import open
file_path = "file_name"
with open(file_path, 'rb') as f:
  file_read=f.read()

Answer 2

Setting error to ignore resolved the problem, thanks N M. The code looks like:

ref_file=open(ref_file_path, 'r', encoding='ascii', errors='ignore') 
file_read=ref_file.read()

The code further treats it as a one big string. Note that although the error was about decoding 0xff it was not UTF-16 coding.

UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 11597: ordinal not in range(128)

Question

2 answers

solution1
0 2017-10-19 03:04:45

solution2
0 2017-10-20 04:29:57

UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 11597: ordinal not in range(128)

Question

2 answers

solution1 0 2017-10-19 03:04:45

solution2 0 2017-10-20 04:29:57

solution1
0 2017-10-19 03:04:45

solution2
0 2017-10-20 04:29:57