简体   繁体   English

如何解决:使用.csv文件的Python导入Dictreader在未定义的字符上失败

[英]How to address: Python import of file with .csv Dictreader fails on undefined character

First of all, I found the following which is basically the same as my question, but it is closed and I'm not sure I understand the reason for closing vs. the content of the post. 首先,我发现以下内容与我的问题基本相同,但它已经关闭,我不确定我理解结束的原因与帖子的内容。 I also don't really see a working answer. 我也没有真正看到一个有效的答案。

I have 20+ input files from 4 apps. 我有来自4个应用程序的20多个输入文件。 All files are exported as .csv files. 所有文件都以.csv文件的形式导出。 The first 19 files worked (4 others exported from the same app work) and then I ran into a file that gives me this error: 前19个文件工作(另外4个从同一个应用程序工作导出)然后我遇到了一个文件,它给了我这个错误:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 5762: character maps to <undefined>

If I looked that up right it is a &lt ctrl &gt. 如果我向右看它是一个&lt ctrl&gt。 The code below are the relevant lines: 以下代码是相关的代码:

with open(file, newline = '') as f: 
    reader = csv.DictReader(f, dialect = 'excel')
    for line in reader:

I know I'm going to be getting a file. 我知道我会得到一个档案。 I know it will be a .csv. 我知道这将是一个.csv。 There may be some variance in what I get due to the manual generation/export of the source files. 由于手动生成/导出源文件,我得到的内容可能会有所不同。 There may also be some strange characters in some of the files (eg Japanese, Russian, etc.). 某些文件中可能还有一些奇怪的字符(例如日语,俄语等)。 I provide this information because going back to the source to get a different file might just kick the can down the road until I have to pull updated data (or worse, someone else does). 我提供这些信息是因为回到源代码获取不同的文件可能会让我们不知所措,直到我必须提取更新数据(或者更糟糕的是,其他人这样做)。

So the question is probably multi-part: 所以问题可能是多部分:
1) Is there a way to tell the csv.DictReader to ignore undefined characters? 1)有没有办法告诉csv.DictReader忽略未定义的字符? (Hint for the codec: if I can't see it, it is of no value to me.) (提示编解码器:如果我看不到它,那对我来说没什么价值。)

2) If I do have "crazy" characters, what should I do? 2)如果我有“疯狂”角色,我该怎么办? I've considered opening each input as a binary file, filtering out offending hex characters, writing the file back to disk and then opening the new file, but that seems like a lot of overhead for the program and even more for me. 我已经考虑将每个输入打开为二进制文件,过滤掉有问题的十六进制字符,将文件写回磁盘然后打开新文件,但这似乎是程序的大量开销,对我来说更多。 It's also a few JCL statements from being 1977 again. 这也是1977年的一些JCL声明。

3) How do I figure out what I'm getting as an input if it crashes while I'm reading it in. 3)如果在我阅读时崩溃,我怎么弄清楚我得到的输入是什么?

4) I chose the "dialect = 'excel'"; 4)我选择了“dialect ='excel'”; because many of the inputs are Excel files that can be downloaded from one of the source applications. 因为许多输入是Excel文件,可以从其中一个源应用程序下载。 From the docs on dictreader, my impression is that this just defines delimiter, quote character and EOL characters to expect/use. 从dictreader上的文档来看,我的印象是这只是定义了分隔符,引用字符和期望/使用的EOL字符。 Therefore, I don't think this is my issue, but I'm also a Python noob, so I'm not 100% sure. 因此,我不认为这是我的问题,但我也是一个Python菜鸟,所以我不是百分百肯定。

I posted the solution I went with in the comments above; 我在上面的评论中发布了我的解决方案; it was to set the errors argument of open() to 'ignore' : open()errors参数设置为'ignore'

with open(file, newline = '', errors='ignore') as f: 

This is exactly what I was looking for in my first question in the original post above (ie whether there is a way to tell the csv.DictReader to ignore undefined characters). 这正是我在上面原帖中的第一个问题中寻找的(即是否有办法告诉csv.DictReader忽略未定义的字符)。

Update: Later I did need to work with some of the Unicode characters and couldn't ignore them. 更新:后来我确实需要使用一些Unicode字符,并且不能忽略它们。 The correct answer for that solution based on Excel-produced unicode .csv file was to use the 'utf_8_sig' codec. 基于Excel生成的unicode .csv文件的解决方案的正确答案是使用'utf_8_sig'编解码器。 That deletes the byte order marker (utf-16 BOM) that Windows writes at the top of the file to let it know there are unicode characters in it. 这将删除Windows在文件顶部写入的字节顺序标记(utf-16 BOM),以便让它知道其中有unicode字符。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM