简体   繁体   中英

UnicodeDecodeError ('utf-8') for pandas read_csv from folder nested Zip File

I currently have a zip file which contains a list of N folders, each containing 1+ .csv files. I am looking to simply read in a selection of these .csv files from the zip and use pandas to create a list of DataFrames.

I've done this successfully the 'manual' way where I unzip the files locally and just read in the individual .csv's.

However, when I use a zipfile method but I'm getting the following error

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xab in position ****: invalid start byte

I thought this would be a straight forward task, but I seem to be missing some step. I've given my code below. However I suspect the issue is rooted in the way zipfile unpacks the documents compared to macOS (technically The Unarchiver).I have generated a test zip file and successfully got a pandas DataFrame output. I'm just getting myself mixed up on how to achieve the same result on the 'real' data.

Sadly I am not able to post the original data in question here.

import pandas as pd
import zipfile

# Sample loader for testing
sample_path = "Sample_ZipFile.zip"
with ZipFile(sample_path) as zipfiles:

sample_file_names = [file.filename for file in zipfiles.infolist() if file.filename[-4:]=='.csv']
data = zipfiles.open(sample_file_names[0])
testdat = pd.read_csv(data,dtype='str',index_col=False)

So after some frustrated searching the next morning, I eventually stumbled across a similar problem in the Pandas github page which you can look at here .

It simply seems to be down to a difference in how Google Colab and Jupyter handle pandas (pd) pd.read_csv (and pd.to_csv ).

For anyone stumbling across the same error, I managed to get through the problem using:

  1. Adding engine='python' to pd.read_csv()
  2. OR adding encoding='cp1252' which a colleague suggested.

I am assuming I was just lucky in my Jupyter Notebooks up until now in not seeing any encoding bugs. But I hope this answer helps anyone who might get as stuck as I did...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM