简体   繁体   English

Python 以 utf-8 或 cp1252 格式导入 .csv 文件

[英]Python importing .csv files in utf-8 or cp1252

I asked a question a while back about dealing with import of .csv files with special characters.不久前我问了一个关于处理带有特殊字符的 .csv 文件导入的问题 At the time I was interested in solving the 90% case, but now I'm back for the last 10%.当时我对解决 90% 的案例很感兴趣,但现在我又回到了最后的 10%。

It's mostly the same setup as before:它与以前的设置大致相同:

  1. Many input files许多输入文件
  2. All .csv所有 .csv
  3. New: Now I want to preserve special characters in some inputs.新:现在我想在某些输入中保留特殊字符。 However, I don't have control over the format of all of my input files, so I have a mix of files that I need to process.但是,我无法控制所有输入文件的格式,因此我需要处理混合文件。 My attempt at the solution was to pass a keyword argument when I want to do a different encoding format.我对解决方案的尝试是当我想要做不同的编码格式时传递一个关键字参数。

Here is the code:这是代码:

import csv
import unicodecsv
#<Lots of other declarations and initialization>

def _csv_dict(self, file,index_field, ScrubMe, **kwargs):

#some irrelevant initialization stuff here.

    if 'formatting' in kwargs:
        formatting = kwargs['formatting']
    else:
        formatting =  None #cp1252 is OS default

    with open(file, encoding=formatting, errors='ignore') as f: #newline = '',
        if formatting == None:
            reader = csv.DictReader(f, dialect = 'excel')
        else: #assume for now UTF-8 is the only other supported format
            reader = unicodecsv.DictReader(f, dialect = csv.excel)

        for line in reader:
            <do some stuff - it's mostly building dictionaries, but I
generally edit the data to only keep the stuff I care about and do a little
data transformation to standard formats >

The result of the above is that if I pass an Excel file saved as a .CSV in native codec, the import works.上面的结果是,如果我在本机编解码器中传递保存为 .CSV 的 Excel 文件,则导入工作。 However, the unicodecsv file with call including formatting='utf-8' keyword arg crashes但是,调用包括 format='utf-8' 关键字 arg 的 unicodecsv 文件崩溃

The error message suggests that I'm passing the wrong type of object somewhere along the line.错误消息表明我在沿线某处传递了错误类型的对象。 This happens the first time I attempt to read a line out of the UTF-8 file这发生在我第一次尝试从 UTF-8 文件中读取一行时

File 
"C:\Users\<me>\AppData\Local\Programs\Python\Python37\lib\site-
packages\unicodecsv\py3.py", line 51, in <genexpr>
f = (bs.decode(encoding, errors=errors) for bs in f)
AttributeError: 'str' object has no attribute 'decode'

From what I have read, UTF-8 is actually tab-separated instead of comma-separated, but I "think" it's supposed to work the same way.从我读过的内容来看,UTF-8 实际上是以制表符分隔而不是逗号分隔,但我“认为”它应该以相同的方式工作。

I feel like I've probably messed up something pretty simple, but I've killed enough time looking that it seems appropriate to ask for help.我觉得我可能把一些非常简单的事情搞砸了,但我已经浪费了足够多的时间来寻找寻求帮助的合适时机。 Thanks in advance for any suggestions.在此先感谢您的任何建议。

I'm replacing my initial answer because I had multiple things going on and it took me a while to untangle them.我正在替换我最初的答案,因为我有很多事情要做,我花了一段时间来解开它们。

1) @lenz is correct. 1)@lenz 是正确的。 In Python 3 it is unnecessary to use unicodecsv.DictReader.在 Python 3 中,没有必要使用 unicodecsv.DictReader。 Part of what confused me is the difference in implementation.让我感到困惑的部分原因是实施上的差异。

a) For the older unicodecsv.DictReader from Python 2: a) 对于 Python 2 中较旧的 unicodecsv.DictReader:

kw_args={'errors' : None}
with open(filename, 'rb', **kw_args) as file:
    reader = unicodecsv.DictReader(file, dialect = csv.excel, encoding='utf_8_sig' )

b) For Python 3 csv.DictReader b) 对于 Python 3 csv.DictReader

kw_args={'newline' : '','errors' : None,'encoding' : 'utf_8_sig'}
with open(filename, 'r', **kw_args) as file:
    reader = csv.DictReader(file, dialect = csv.excel )

To summarize the differences总结差异

  • Mode of file open is now text instead of bytes文件打开模式现在是文本而不是字节
  • Because of the different open method, the codec can/should be specified in the file open vs. in the DictReader由于不同的打开方法,编解码器可以/应该在文件 open 和 DictReader 中指定
  • newline parameter is also only valid for the file opened as text.换行参数也仅对作为文本打开的文件有效。

2) Because my UTF-8 file was produced by Excel, it has a utf_16_le style BOM at the top of the file. 2)因为我的UTF-8文件是Excel生成的,所以在文件顶部有一个utf_16_le样式的BOM。 The only coded that works for this is 'utf_8_sig'.唯一适用于此的编码是“utf_8_sig”。

3) Because my output files are being read downstream by SQL Server, the output codec needs to be 'utf_16_le' or SQL Server doesn't recognize it. 3) 因为我的输出文件正在被 SQL Server 下游读取,所以输出编解码器需要是 'utf_16_le' 或者 SQL Server 无法识别它。

4) Also, because the target is SQL Server, I have to manually insert the BOM at the top of the file. 4)另外,因为目标是SQL Server,所以我必须手动在文件顶部插入BOM。

csvfile.write('\uFEFF') 
writer.writeheader()

If you open the above output file in Excel it will no longer be in columns, but SQL Server (actually SSIS) now knows how to read the file.如果您在 Excel 中打开上述输出文件,它将不再以列形式存在,但 SQL Server(实际上是 SSIS)现在知道如何读取该文件。

5) Just to mess with me a little more, someone had '\\n' in a few of the records. 5) 只是为了多打扰我一点,有人在一些记录中使用了 '\\n'。 With Excel as source and destination, this was not an issue, but it was for SSIS.使用 Excel 作为源和目标,这不是问题,但它适用于 SSIS。 My solution:我的解决方案:

for r in record_list:
    temp={}
    for k,v in r.items():

        if isinstance(v,str):
            temp[k] = v.replace('\n',' ')
        else:
            temp[k] = v
    writer.writerow(temp) 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM