Pandas read_csv - 修改 Excel.csv 文件后错误标记数据

Question

I have a CSV dataset for an ML classifier.我有一个用于 ML 分类器的 CSV 数据集。 It has 2 columns and looks like this:它有 2 列，如下所示：

But this dataset is very dirty, so I decided to open it with Excel, remove "dirty" words, and save it as a new CSV file and train my ML classifier on it.但是这个数据集非常脏，所以我决定用 Excel 打开它，删除“脏”字，并将其保存为新的 CSV 文件并在其上训练我的 ML 分类器。

But after I saved it in Excel (using , separator and also tried , UTF-8 ), and when trying pd.read_csv on it, it gives me this error:但是在我将它保存在 Excel 之后（使用,分隔符并尝试, UTF-8 ），并且在尝试pd.read_csv时，它给了我这个错误：

Error tokenizing data. C error: Expected 3 fields in line 4, saw 5

Then I tried to use sep=';'然后我尝试使用sep=';' with read_csv , and it worked, but now all Russian characters are replaced with strange symbols:用read_csv ，它工作，但现在所有的俄语字符都被替换为奇怪的符号：

Can somebody explain please how to repair "question"-symbols from Russian characters?有人可以解释一下如何修复俄语字符中的“问题”符号吗？ encoding='UTF-8' gives this error: encoding='UTF-8'给出了这个错误：

'utf-8' codec can't decode byte 0xe6 in position 22: invalid continuation byte

This is what the first file looks like (not modified Excel .csv file):这是第一个文件的样子（未修改 Excel .csv文件）：

When I open second file (modified):当我打开第二个文件（修改）时：

Answer 1

Try opening the file with either ptcp154 or kz1048 encodings.尝试使用ptcp154或kz1048编码打开文件。 They seem to work.他们似乎工作。

Pandas read_csv - 修改 Excel.csv 文件后错误标记数据

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-11-20 15:49:53

Pandas read_csv - 修改 Excel.csv 文件后错误标记数据

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-11-20 15:49:53

解决方案1
1 已采纳 2021-11-20 15:49:53