简体   繁体   中英

Encoding error when reading csv file containing pandas dataframe

I used df.to_csv() to convert a dataframe to csv file. Under python 3 the pandas doc states that it defaults to utf-8 encoding.

However when I run pd.read_csv() on the same file, I get the error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 8: invalid start byte

But using pd.read_csv() with encoding="ISO-8859-1" works.

What is the issue here and how do I resolve it so I can write and load files with consistent encoding?

The original .csv you are trying to read is encoded in eg ISO-8859-1 . That's why it's a UnicodeDecodeError - python / pandas is trying to decode the source using utf-8 codec assuming per default the source is unicode .

Once you indicate the non-default source encoding, pandas will use the proper codec to match the source and decode into the format used internally.

See python docs and more here . Also very good.

请尝试使用encoding ='unicode_escape'读取数据。

Here is a concrete example of pandas using some unknown(?) encoding when not explicitly using the encoding parameter with pandas.to_csv .

0x92 is ' (looks like an apostrophe)

import pandas
ERRORFILE = r'written_without_encoding_parameter.csv'
NO_ERRORFILE = r'written_WITH_encoding_parameter.csv'

df_dummy = pandas.DataFrame([u"Yo what's up", u"I like your sister’s friend"])

df_dummy.to_csv(ERRORFILE)
df_dummy.to_csv(NO_ERRORFILE, encoding="utf-8")

df_no_error_with_latin = pandas.read_csv(ERRORFILE, encoding="Latin-1")
df_no_error = pandas.read_csv(NO_ERRORFILE)
df_error = pandas.read_csv(ERRORFILE)
>>> UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte

So it looks like you have to explicitly use encoding="utf-8" with to_csv even though pandas docs say it is using this by default. Or use encoding="Latin-1" with read_csv .

Even more frustrating...

df_error_even_with_utf8 = pandas.read_csv(ERRORFILE, encoding="utf-8")
>>> UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte

I am using Windows 7, Python 3.5, pandas 0.19.2.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM