[英]Pandas read_csv - Error tokenizing data after modifying Excel .csv file
I have a CSV dataset for an ML classifier.我有一个用于 ML 分类器的 CSV 数据集。 It has 2 columns and looks like this:它有 2 列,如下所示:
But this dataset is very dirty, so I decided to open it with Excel, remove "dirty" words, and save it as a new CSV file and train my ML classifier on it.但是这个数据集非常脏,所以我决定用 Excel 打开它,删除“脏”字,并将其保存为新的 CSV 文件并在其上训练我的 ML 分类器。
But after I saved it in Excel (using ,
separator and also tried , UTF-8
), and when trying pd.read_csv
on it, it gives me this error:但是在我将它保存在 Excel 之后(使用,
分隔符并尝试, UTF-8
),并且在尝试pd.read_csv
时,它给了我这个错误:
Error tokenizing data. C error: Expected 3 fields in line 4, saw 5
Then I tried to use sep=';'
然后我尝试使用sep=';'
with read_csv
, and it worked, but now all Russian characters are replaced with strange symbols:用read_csv
,它工作,但现在所有的俄语字符都被替换为奇怪的符号:
Can somebody explain please how to repair "question"-symbols from Russian characters?有人可以解释一下如何修复俄语字符中的“问题”符号吗? encoding='UTF-8'
gives this error: encoding='UTF-8'
给出了这个错误:
'utf-8' codec can't decode byte 0xe6 in position 22: invalid continuation byte
This is what the first file looks like (not modified Excel .csv
file):这是第一个文件的样子(未修改 Excel .csv
文件):
When I open second file (modified):当我打开第二个文件(修改)时:
Try opening the file with either ptcp154
or kz1048
encodings.尝试使用ptcp154
或kz1048
编码打开文件。 They seem to work.他们似乎工作。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.