简体   繁体   中英

Pandas read_csv - Error tokenizing data after modifying Excel .csv file

I have a CSV dataset for an ML classifier. It has 2 columns and looks like this:

在此处输入图像描述

But this dataset is very dirty, so I decided to open it with Excel, remove "dirty" words, and save it as a new CSV file and train my ML classifier on it.

But after I saved it in Excel (using , separator and also tried , UTF-8 ), and when trying pd.read_csv on it, it gives me this error:

Error tokenizing data. C error: Expected 3 fields in line 4, saw 5

Then I tried to use sep=';' with read_csv , and it worked, but now all Russian characters are replaced with strange symbols:

在此处输入图像描述

Can somebody explain please how to repair "question"-symbols from Russian characters? encoding='UTF-8' gives this error:

'utf-8' codec can't decode byte 0xe6 in position 22: invalid continuation byte

This is what the first file looks like (not modified Excel .csv file):

在此处输入图像描述

When I open second file (modified):

在此处输入图像描述

Try opening the file with either ptcp154 or kz1048 encodings. They seem to work.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM