简体   繁体   中英

'utf-8' codec can't decode byte 0x89

I want to read a csv file and process some columns but I keep getting issues. Stuck with the following error:

Traceback (most recent call last):
  File "C:\Users\Sven\Desktop\Python\read csv.py", line 5, in <module>
    for row in reader:
  File "C:\Python34\lib\codecs.py", line 313, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 446: invalid start byte
>>> 

My Code

import csv
with open("c:\\Users\\Sven\\Desktop\\relaties 24112014.csv",newline='', encoding="utf8") as f:
    reader = csv.reader(f,delimiter=';',quotechar='|')
    #print(sum(1 for row in reader))
    for row in reader:
        print(row)
        if row:
            value = row[6]
            value = value.replace('(', '')
            value = value.replace(')', '')
            value = value.replace(' ', '')
            value = value.replace('.', '')
            value = value.replace('0032', '0')
            if len(value) > 0:
                print(value + ' Length: ' + str(len(value)))

I'm a beginner with Python, tried googling, but hard to find the right solution.

Can anyone help me out?

This is the most important clue:

invalid start byte

\\x89 is not, as suggested in the comments, an invalid UTF-8 byte. It is a completely valid continuation byte. Meaning if it follows the correct byte value, it codes UTF-8 correctly:

http://hexutf8.com/?q=0xc90x89

So either you (1) do not have UTF-8 data as you expect, or (2) you have some malformed UTF-8 data. The Python codec is simply letting you know that it encountered \\x89 in the wrong order in the sequence.

(More on continuation bytes here: http://en.wikipedia.org/wiki/UTF-8#Codepage_layout )

The first byte of a .PNG file is 0x89 . Not saying that is your problem, but the .PNG header is specifically designed so that it is NOT accidentally interpreted as text .

Why you would have a .csv file that is actually a .png I don't know. But it definitely could happen if someone accidentally renamed the file. On windows 10 every once and a while I accidentally mass-rename files by accident because of their stupid checkbox feature. Why Microsoft decided desktop machines having identical UI controls to tablets was I good idea... I don't know.

I was also getting the similar error when trying to read or upload the following kinds of files:

  1. CSV File
  2. JPEG File
  3. PNG File
  4. Zip File

The best way to avoid error like:

  1. 'utf-8' codec can't decode byte 0x89
  2. 'utf-8' codec can't decode byte 0xff

is to read these files as Bytes. When you treat them as byte then you need not provide any encoding value here. So when you open them you should specify:

with open(file_path, 'rb') as file:

Or in your case, the code should be something like:

import csv

with open("c:\\\\Users\\\\Sven\\\\Desktop\\\\relaties 24112014.csv", newline='', 'rb') as f:

reader = csv.reader(f,delimiter=';',quotechar='|')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM