简体   繁体   中英

Python - can't parse utf8 csv

I tried to use csv module to parse csv file, but it does not handle utf-8 encodings.

So I tried these methods that were suggested in documentation:

def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
    # csv.py doesn't do Unicode; encode temporarily as UTF-8:
    csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
                            dialect=dialect, **kwargs)
    for row in csv_reader:
        # decode UTF-8 back to Unicode, cell by cell:
        yield [unicode(cell, 'utf-8') for cell in row]

def utf_8_encoder(unicode_csv_data):
    for line in unicode_csv_data:
        yield line.encode('utf-8')

But if I try to use it like that:

with open(u'spam1.csv', 'rb') as csvfile:
    spamreader = unicode_csv_reader(csvfile, delimiter=',', quotechar='"')
    for row in spamreader:
        print row

I get this error:

yield line.encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 15: ordinal not in range(128)

But if I open that file with libreoffice, it opens that csv file with utf-8 encoding fine.

The code is meant to be used on unicode values ; eg you need to decode your data to unicode before passing it in to the replacement reader.

Use io.open() read the data as Unicode:

import io

with io.open(u'spam1.csv', 'r', encoding='utf8') as csvfile:
    spamreader = unicode_csv_reader(csvfile, delimiter=',', quotechar='"')
    for row in spamreader:
        print row

This basically temporarily encodes unicode to UTF8 for the CSV module to handle.

Because your data is already encoded to UTF8, you could get away with:

with open(u'spam1.csv', 'rb') as csvfile:
    spamreader = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in spamreader:
        row = [unicode(cell, 'utf-8') for cell in row]

as well; so directly decode your row cells from UTF8 without decoding to Unicode first, then encoding again to UTF8 bytes then decoding again.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM