简体   繁体   English

CSV 一列编码错误。 不能 pandas.read_csv

[英]CSV one column bad encoded. Can't pandas.read_csv

I have a CSV file that contains several columns.我有一个包含多列的 CSV 文件。 One of those columns is corrupted, bad encoded.其中一列已损坏,编码错误。 The column named title has characters of all kinds of languages: French, Italy, etc...名为title的列具有各种语言的字符:法语、意大利语等...

num | ratio |  title   | ...
 1     1.2     🥶2ï
 2     2.5     djije
 3     4.1     abc
...    ...     ...

When I try to read the file pandas.read_csv('myFile.csv') I receive the following error:当我尝试读取文件pandas.read_csv('myFile.csv') ,我收到以下错误:

'utf-8' codec can't decode byte 0xcf in position 3: invalid continuation byte

How can I read the csv file with pandas and leaving the title column blank or giving it some default value if it can't be read?如果无法读取,如何使用pandas读取 csv 文件并将标题列留空或为其提供一些默认值?

If your file contains mixed encodings, you can read it into memory as binary, or as a hack, open it as Latin-1 and then decode the Title field individually on each line.如果您的文件包含混合编码,您可以将其作为二进制文件读入 memory,或者作为 hack,将其作为 Latin-1 打开,然后在每一行上单独解码标题字段。

If the majority of the data is encoded as UTF-8, you can attempt to decode it with如果大部分数据编码为 UTF-8,您可以尝试使用

title.encode('latin-1').decode('utf-8 )

but fall back and keep it in Latin-1, or replace it with some sort of error message, if decoding fails.但回退并将其保留在 Latin-1 中,或者如果解码失败,则将其替换为某种错误消息。

I'm not a Pandas person, but quick googling gets me something like我不是 Pandas 人,但快速谷歌搜索让我得到类似的东西

import pandas as pd

df = pd.read_csv('myFile.csv', encoding='latin-1')

def attempt_decode(x):
    try:
        return x.encode('latin-1').decode('utf-8')
    except UnicodeDecodeError:
        return 'Unable to decode: %s' % x)

df['Title'] = df['Title'].apply(attempt_decode)

Latin-1 has the unique property that every input byte corresponds exactly to the Unicode code point with the same value, so you never get a decoding error (but, obviously, mojibake if the correct encoding is something else, and you fail to correct it). Latin-1 具有独特的属性,即每个输入字节都完全对应于具有相同值的Unicode代码点,因此您永远不会遇到解码错误(但是,很明显,如果正确的编码是其他东西,并且您无法纠正它).

If you want to exclude the column title alone, read all the columns and drop the column title .如果您想单独排除列title ,请阅读所有列并删除列title

Eg.例如。

df = pd.read_csv('filename')

df = data.drop('title', axis = 1)

To give the column a default value, use:要为列提供默认值,请使用:

df['title'] = 0 #(the value you want to provide by default)

or use:或使用:

df['title'] = np.nan 

to fill the column with null values.用 null 个值填充该列。

Hope that answers your question!希望这能回答你的问题!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM