CSV 一列编码错误。不能 pandas.read_csv

Question

I have a CSV file that contains several columns.我有一个包含多列的 CSV 文件。 One of those columns is corrupted, bad encoded.其中一列已损坏，编码错误。 The column named title has characters of all kinds of languages: French, Italy, etc...名为title的列具有各种语言的字符：法语、意大利语等...

num | ratio |  title   | ...
 1     1.2     ðŸ¥¶2ï
 2     2.5     djije
 3     4.1     abc
...    ...     ...

When I try to read the file pandas.read_csv('myFile.csv') I receive the following error:当我尝试读取文件pandas.read_csv('myFile.csv') ，我收到以下错误：

'utf-8' codec can't decode byte 0xcf in position 3: invalid continuation byte

How can I read the csv file with pandas and leaving the title column blank or giving it some default value if it can't be read?如果无法读取，如何使用pandas读取 csv 文件并将标题列留空或为其提供一些默认值？

Answer 1

If your file contains mixed encodings, you can read it into memory as binary, or as a hack, open it as Latin-1 and then decode the Title field individually on each line.如果您的文件包含混合编码，您可以将其作为二进制文件读入 memory，或者作为 hack，将其作为 Latin-1 打开，然后在每一行上单独解码标题字段。

If the majority of the data is encoded as UTF-8, you can attempt to decode it with如果大部分数据编码为 UTF-8，您可以尝试使用

title.encode('latin-1').decode('utf-8 )

but fall back and keep it in Latin-1, or replace it with some sort of error message, if decoding fails.但回退并将其保留在 Latin-1 中，或者如果解码失败，则将其替换为某种错误消息。

I'm not a Pandas person, but quick googling gets me something like我不是 Pandas 人，但快速谷歌搜索让我得到类似的东西

import pandas as pd

df = pd.read_csv('myFile.csv', encoding='latin-1')

def attempt_decode(x):
    try:
        return x.encode('latin-1').decode('utf-8')
    except UnicodeDecodeError:
        return 'Unable to decode: %s' % x)

df['Title'] = df['Title'].apply(attempt_decode)

Latin-1 has the unique property that every input byte corresponds exactly to the Unicode code point with the same value, so you never get a decoding error (but, obviously, mojibake if the correct encoding is something else, and you fail to correct it). Latin-1 具有独特的属性，即每个输入字节都完全对应于具有相同值的Unicode代码点，因此您永远不会遇到解码错误（但是，很明显，如果正确的编码是其他东西，并且您无法纠正它).

Answer 2

If you want to exclude the column title alone, read all the columns and drop the column title .如果您想单独排除列title ，请阅读所有列并删除列title 。

Eg.例如。

df = pd.read_csv('filename')

df = data.drop('title', axis = 1)

To give the column a default value, use:要为列提供默认值，请使用：

df['title'] = 0 #(the value you want to provide by default)

or use:或使用：

df['title'] = np.nan

to fill the column with null values.用 null 个值填充该列。

Hope that answers your question!希望这能回答你的问题！

CSV 一列编码错误。不能 pandas.read_csv

问题描述

2 个解决方案

解决方案1
0 2021-08-23 07:03:25

解决方案2
-1 2021-08-23 07:06:32

CSV 一列编码错误。 不能 pandas.read_csv

问题描述

2 个解决方案

解决方案1 0 2021-08-23 07:03:25

解决方案2 -1 2021-08-23 07:06:32

CSV 一列编码错误。不能 pandas.read_csv

解决方案1
0 2021-08-23 07:03:25

解决方案2
-1 2021-08-23 07:06:32