读取包含Pandas DataFrame的CSV文件时出现编码错误

Question

I used df.to_csv() to convert a dataframe to csv file. 我使用df.to_csv()将数据帧转换为csv文件。 Under python 3 the pandas doc states that it defaults to utf-8 encoding. 在python 3下， pandas doc声明其默认为utf-8编码。

However when I run pd.read_csv() on the same file, I get the error: 但是，当我在同一文件上运行pd.read_csv()时，出现错误：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 8: invalid start byte

But using pd.read_csv() with encoding="ISO-8859-1" works. 但是将pd.read_csv()与encoding="ISO-8859-1"一起使用是pd.read_csv() 。

What is the issue here and how do I resolve it so I can write and load files with consistent encoding? 这是什么问题，我该如何解决，这样才能以一致的编码方式写入和加载文件？

Answer 1

The original .csv you are trying to read is encoded in eg ISO-8859-1 . 原来.csv你试图读取encoded在如ISO-8859-1 。 That's why it's a UnicodeDecodeError - python / pandas is trying to decode the source using utf-8 codec assuming per default the source is unicode . 这就是为什么它是一个UnicodeDecodeError -蟒蛇/大熊猫正试图decode使用源utf-8编码解码器每默认假设源是unicode 。

Once you indicate the non-default source encoding, pandas will use the proper codec to match the source and decode into the format used internally. 指明非默认源编码后，熊猫将使用适当的编解码器来匹配源并解码为内部使用的格式。

See python docs and more here . 在这里查看python docs和更多内容。 Also very good. 也很好。

Answer 2

请尝试使用encoding ='unicode_escape'读取数据。

Answer 3

Here is a concrete example of pandas using some unknown(?) encoding when not explicitly using the encoding parameter with pandas.to_csv . 这是未使用pandas.to_csv明确使用encoding参数时，使用某些unknown（？）编码的熊猫的具体示例。

0x92 is ' (looks like an apostrophe) 0x92是'（看起来像撇号）

import pandas
ERRORFILE = r'written_without_encoding_parameter.csv'
NO_ERRORFILE = r'written_WITH_encoding_parameter.csv'

df_dummy = pandas.DataFrame([u"Yo what's up", u"I like your sister’s friend"])

df_dummy.to_csv(ERRORFILE)
df_dummy.to_csv(NO_ERRORFILE, encoding="utf-8")

df_no_error_with_latin = pandas.read_csv(ERRORFILE, encoding="Latin-1")
df_no_error = pandas.read_csv(NO_ERRORFILE)

df_error = pandas.read_csv(ERRORFILE)
>>> UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte

So it looks like you have to explicitly use encoding="utf-8" with to_csv even though pandas docs say it is using this by default. 因此，即使熊猫文档说默认情况下使用它，您似乎也必须对to_csv显式使用encoding="utf-8" 。 Or use encoding="Latin-1" with read_csv . 或者，将encoding="Latin-1"与read_csv一起read_csv 。

Even more frustrating... 更令人沮丧的是...

df_error_even_with_utf8 = pandas.read_csv(ERRORFILE, encoding="utf-8")
>>> UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte

I am using Windows 7, Python 3.5, pandas 0.19.2. 我正在使用Windows 7，Python 3.5，熊猫0.19.2。

读取包含Pandas DataFrame的CSV文件时出现编码错误

问题描述

3 个解决方案

解决方案1
2 2016-05-11 06:38:13

解决方案2
1 2019-02-01 07:25:24

解决方案3
0 已采纳 2017-05-29 02:19:52

读取包含Pandas DataFrame的CSV文件时出现编码错误

问题描述

3 个解决方案

解决方案1 2 2016-05-11 06:38:13

解决方案2 1 2019-02-01 07:25:24

解决方案3 0 已采纳 2017-05-29 02:19:52

解决方案1
2 2016-05-11 06:38:13

解决方案2
1 2019-02-01 07:25:24

解决方案3
0 已采纳 2017-05-29 02:19:52