简体   繁体   English

utf-16-le BOM csv文件

[英]utf-16-le BOM csv files

I'm downloading some CSV files from playstore (stats etc) and want to process with python. 我正在从playstore(统计数据等)下载一些CSV文件,并希望使用python进行处理。

cromestant@jumphost-vpc:~/stat_dev/bime$ file -bi stats/installs/*
text/plain; charset=utf-16le
text/plain; charset=utf-16le
text/plain; charset=utf-16le
text/plain; charset=utf-16le
text/plain; charset=utf-16le
text/plain; charset=utf-16le

As you can see they are utf-16le. 如你所见,他们是utf-16le。

I have some code on python 2.7 that works on some files and not on others: 我在python 2.7上有一些代码可以处理某些文件而不是其他文件:

import codecs
.
.
fp =codecs.open(dir_n+'/'+file_n,'r',"utf-16")
 for line in fp:
  #write to mysql db

This works until: 这工作直到:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in position 10: ordinal not in range(128)

What is the proper way to do this? 这样做的正确方法是什么? I've seen "re encode" use cvs module etc. but csv module does not handle encoding by itself, so it seems overkill for just dumping to a database 我已经看过“重新编码”使用cvs模块等,但csv模块本身不处理编码,因此仅仅转储到数据库似乎有点过头了

Have you tried codecs.EncodedFile ? 你尝试过codecs.EncodedFile吗?

with open('x.csv', 'rb') as f:
    g = codecs.EncodedFile(f, 'utf8', 'utf-16le', 'ignore')
    c = csv.reader(g)
    for row in c:
        print row
        # and if you want to use unicode instead of str:
        row = [unicode(cell, 'utf8') for cell in row]

What is the proper way to do this? 这样做的正确方法是什么?

The proper way is to use Python3, in which Unicode support is vastly more rational. 正确的方法是使用Python3,其中Unicode支持更加合理。

As a work-around, if you are allergic to Python3 for some reason, the best compromise is to wrap csv.reader() , like so: 作为解决方法,如果你因为某种原因对Python3过敏,最好的妥协是包装csv.reader() ,如下所示:

import codecs
import csv

def to_utf8(fp):
    for line in fp:
        yield line.encode("utf-8")

def from_utf8(fp):
    for line in fp:
        yield [column.decode('utf-8') for column in line]

with codecs.open('utf16le.csv','r', 'utf-16le') as fp:
    reader = from_utf8(csv.reader(to_utf8(fp)))
    for line in reader:
        #"line" is a list of unicode strings
        #write to mysql db
        print line

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用BOM写入UTF-16-LE文本文件 - Writing to UTF-16-LE text file with BOM 如何使用utf-16-le bom编码保存文件? - How save file with utf-16-le bom encoding? UnicodeDecodeError: 'utf-16-le' - UnicodeDecodeError: 'utf-16-le' 用于UTF-16-LE文件的Python字符串替换 - Python string replace for UTF-16-LE file 在 PowerShell 中进行管道传输时,如何确保 Python 打印 UTF-8(而不是 UTF-16-LE)? - How to ensure Python prints UTF-8 (and not UTF-16-LE) when piped in PowerShell? 从 MS Access 数据库表中获取列时出现 Python 'utf-16-le' 错误 - Python 'utf-16-le' Error when getting the columns from MS Access Database table Ansible/jinja2:读取 utf-16-le 文件并解码为可用字符串的问题 - Ansible/jinja2: issue reading a utf-16-le file and decoding into a usable string 在 PYTHON 中读取 EXCEL 时,“utf-16-le”编解码器无法解码字节 - 'utf-16-le' codec can't decode bytes while reading EXCEL in PYTHON 如何将 UTF-16-LE txt 文件转换为 ANSI txt 文件并删除 PYTHON 中的 header? - How can i convert a UTF-16-LE txt file to an ANSI txt file and remove the header in PYTHON? Python3 - “utf-16-le”编解码器无法在位置 12 编码字符“\?”:不允许代理 - Python3 - 'utf-16-le' codec can't encode character '\udce2' in position 12: surrogates not allowed
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM