Pandas read_csv - Error tokenizing data after modifying Excel .csv file

Question

I have a CSV dataset for an ML classifier. It has 2 columns and looks like this:

But this dataset is very dirty, so I decided to open it with Excel, remove "dirty" words, and save it as a new CSV file and train my ML classifier on it.

But after I saved it in Excel (using , separator and also tried , UTF-8 ), and when trying pd.read_csv on it, it gives me this error:

Error tokenizing data. C error: Expected 3 fields in line 4, saw 5

Then I tried to use sep=';' with read_csv , and it worked, but now all Russian characters are replaced with strange symbols:

Can somebody explain please how to repair "question"-symbols from Russian characters? encoding='UTF-8' gives this error:

'utf-8' codec can't decode byte 0xe6 in position 22: invalid continuation byte

This is what the first file looks like (not modified Excel .csv file):

When I open second file (modified):

Answer 1

Try opening the file with either ptcp154 or kz1048 encodings. They seem to work.

Pandas read_csv - Error tokenizing data after modifying Excel .csv file

Question

1 answers

solution1
1 ACCPTED 2021-11-20 15:49:53

Pandas read_csv - Error tokenizing data after modifying Excel .csv file

Question

1 answers

solution1 1 ACCPTED 2021-11-20 15:49:53

solution1
1 ACCPTED 2021-11-20 15:49:53