I am using well known 20 Newsgroups data set for text categorisation in jupyter. When I try to open and read file on my Mac, it fails at decoding step. I tried to read the file in byte format, which works but I further need to work with it as string. I tried to encode it but it fails with the error.
Code
with open(file_path, 'rb') as f:
file_read=f.read()
file_read.decode("us-ascii")
Error
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 11597: ordinal not in range(128)
Us-ascii is file encoding I found when typing in terminal: file -I file_name
. I tried other encodings but none works. I further want to remove punctuation and count words in the file. Is there a way how to overcome this issue?
It is tricky without looking at the file. However this works most of the time
from codecs import open
file_path = "file_name"
with open(file_path, 'rb') as f:
file_read=f.read()
Setting error to ignore resolved the problem, thanks N M. The code looks like:
ref_file=open(ref_file_path, 'r', encoding='ascii', errors='ignore')
file_read=ref_file.read()
The code further treats it as a one big string. Note that although the error was about decoding 0xff it was not UTF-16 coding.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.