简体   繁体   中英

Pandas read_csv filepath with special characters codec can't decode

I am using Python version 3.5.3 and Pandas version 0.20.1

I use read_csv to read in csv files. I use a file pointer according to this post (I prefer this over the solution using _enablelegacywindowsfsencoding() ). The following code works:

import pandas as pd

with open("C:/Desktop/folder/myfile.csv") as fp:
    df=pd.read_csv(fp, sep=";", encoding ="latin")

This does work. However, when there is a special character like ä in the filename as follows:

import pandas as pd

with open("C:/Desktop/folderÄ/myfile.csv") as fp:
    df=pd.read_csv(fp, sep=";", encoding ="latin")

Python displays an error message: (unicode error) 'utf-8' codec can't decode byte oxc4 in position 0: unexpected end of data.

I also tried to add a 'r' before the filepath, however I get the same error message, except that now I get a position as integer number which is exactly where my special character is in the filepath.

So the reason is the special character in the filepath name.

(Not a decode error which can be solved by using encoding="utf-8" or any other like ISO-5589-1. To be absolutely sure, I tried it with the following encodings and always got the same error message: utf-8, ISO-5589-1, cp1252)

The error indicates your source file (not the data file) is not encoded in UTF-8. In Python 3, your source file must either be saved in UTF-8 encoding, or you must declare the encoding that the source file is saved in with a special comment, eg #coding=Windows-1252 at the top of the file. \xc4 is the Windows-1252 encoding of Ä and is the default encoding for Western European and US Windows, so it's a good guess. Ideally, re-save your source in UTF-8.

For example, if the source is Windows-1252-encoded and the data file is GB2312-encoded (Chinese):

#coding=Windows-1252                         # encoding of source file
import pandas as pd
with open('DÄTÄ.csv',encoding='gb2312') as f:  # encoding of data file
    data = pd.read_csv(f)

Note that source files default to UTF-8 encoding, but open defaults to the encoding returned by locale.getpreferredencoding(FALSE) . Since that varies with OS and configuration, it is best to always specify the encoding when opening files.

Try using unicode file paths u'path/to/files' for example

import pandas as pd

with open(u'C:/Desktop/folderÄ/myfile.csv') as fp:
    df=pd.read_csv(fp, sep=";", encoding ="latin")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM