简体   繁体   中英

utf-16-le BOM csv files

I'm downloading some CSV files from playstore (stats etc) and want to process with python.

cromestant@jumphost-vpc:~/stat_dev/bime$ file -bi stats/installs/*
text/plain; charset=utf-16le
text/plain; charset=utf-16le
text/plain; charset=utf-16le
text/plain; charset=utf-16le
text/plain; charset=utf-16le
text/plain; charset=utf-16le

As you can see they are utf-16le.

I have some code on python 2.7 that works on some files and not on others:

import codecs
.
.
fp =codecs.open(dir_n+'/'+file_n,'r',"utf-16")
 for line in fp:
  #write to mysql db

This works until:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in position 10: ordinal not in range(128)

What is the proper way to do this? I've seen "re encode" use cvs module etc. but csv module does not handle encoding by itself, so it seems overkill for just dumping to a database

Have you tried codecs.EncodedFile ?

with open('x.csv', 'rb') as f:
    g = codecs.EncodedFile(f, 'utf8', 'utf-16le', 'ignore')
    c = csv.reader(g)
    for row in c:
        print row
        # and if you want to use unicode instead of str:
        row = [unicode(cell, 'utf8') for cell in row]

What is the proper way to do this?

The proper way is to use Python3, in which Unicode support is vastly more rational.

As a work-around, if you are allergic to Python3 for some reason, the best compromise is to wrap csv.reader() , like so:

import codecs
import csv

def to_utf8(fp):
    for line in fp:
        yield line.encode("utf-8")

def from_utf8(fp):
    for line in fp:
        yield [column.decode('utf-8') for column in line]

with codecs.open('utf16le.csv','r', 'utf-16le') as fp:
    reader = from_utf8(csv.reader(to_utf8(fp)))
    for line in reader:
        #"line" is a list of unicode strings
        #write to mysql db
        print line

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM