简体   繁体   中英

Error n reading csv file: utf-8 codec cant decode

While running the code to merge(basically inner join) two csv files I am facing an error while reading csv file. My code:

import csv
import pandas as pd
s1= pd.read_csv(".../noun.csv")
s2= pd.read_csv(".../verb.csv")
merged= s1.merge(s2, on=("userID" ,"sentID"), how ="inner")
merged.to_excel(".../merge1.xlsx",index = False)

Error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 5: invalid start byte

example of my content is:

verb file

userID  sentID  verb
['3477'  1     ['am', 'were', 'having', 'attended', 'stopped']
['3477'  2     ['felt', 'thrusting']

noun file
userID  sentID  Sentences
['3477'   1    Thursday,
['3477'   1    November

You can use a library that attempts to detect the encoding, for example cchardet :

pip install cchardet

If you use python 2.X you also need a backport of the CSV library. They support Unicode natively, while Python 2's csv does not:

pip install backports.csv

Then in your code you can do something like this:

import cchardet
import io
from backports import csv

# detect encoding
with io.open(filename, mode="rb") as f:
    data = f.read()
detect = cchardet.detect(data)
encoding_ = detect['encoding']
# retrieve data
with io.open(filename, encoding=encoding_) as csvfile:
    reader = csv.reader(csvfile, ...)
...

I don't know pandas, but you can do something like this:

# retrieve data
s1= pd.read_csv(".../noun.csv", encoding=encoding_)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM