简体   繁体   English

Python-处理CSV文件中的数据错误

[英]Python - Handling Data errors in CSV file

I have a CSV file that may have invalid UTF-8 encodings on some rows. 我有一个CSV文件,在某些行上可能具有无效的UTF-8编码。 The file is sometimes hundreds of thousands of rows long, so I want to just skip the rows with invalid characters (noting that) to get the 99.9% of the rows that are valid (for this application, it's not essential that every row in the input get into the database). 该文件有时长数十万行,因此我想跳过带有无效字符的行(请注意)以获取有效行的99.9%(对于此应用程序,不必输入进入数据库)。

My Python code looks like this: 我的Python代码如下所示:

# Iterate through the CSV file
with open(fileName, "rt", encoding="utf8") as csvFile:
    try:
        reader = csv.DictReader(csvFile)
        for csvDataRow in reader:
            try:
                log.debug('Row '+str(lineNo))
                #
                # .. row handling code here ..
                #
            except Exception as e:
                log.error('Exception at the for loop level\n'+str(e))
    except Exception as e:
            log.error('Exception at the reader level\n'+str(e))

What I would expect is that the invalid data would trigger the exception at the for loop level, so I could catch just UnicodeEncodingError there and skip the line, then continue the loop. 我期望的是无效数据将在for循环级别触发异常,因此我可以在那里捕获UnicodeEncodingError并跳过该行,然后继续循环。

The problem is that the exception doesn't trigger there - it hits the except clause at the reader level - ie outside the loop context. 问题是异常不会在那里触发-它在读取器级别命中了except子句-即在循环上下文之外。 So I can no longer do continue on the for loop iterating over the rows. 因此,我不再可以继续在for循环上遍历行。

The net result is that if I hit a single invalid row at line 674,398 in the CSV file that has a total of 2,966,480 rows the exception causes the rows after row 674,398 to be skipped. 最终结果是,如果我在CSV文件的674,398行命中了一个无效行,该行总共有2,966,480行,则该异常会导致674,398行之后的行被跳过。 In this case, it turns out that line in the input has an invalid continuation character that breaks the UTF-8 parser. 在这种情况下,事实证明输入中的行具有无效的连续字符,该字符会破坏UTF-8分析器。

I spent a fair bit of time reading the Python CSV documentation and searching around to find a solution to this. 我花了很多时间阅读Python CSV文档并四处寻找以找到解决方案。 The key seems to be that the exception is coming from this line: 关键似乎是异常来自此行:

       for csvDataRow in reader:

ie it is being triggered in the call to the DictReader iterator to get the next row. 也就是说,它在调用DictReader迭代器以获取下一行时被触发。 Nowhere in the CSV documentation does it mention how to handle errors like this. CSV文档中没有任何地方提到如何处理此类错误。

The trick is that the encoding transformation isn't happening in CSV - it's happening underneath it, and so the change that's needed is in the open call. 诀窍在于,编码转换不是在CSV中发生的,而是在其下面发生的,因此所需的更改在open调用中。

Adding errors="replace" to the open call causes the Codec transform to substitute a '?' 在打开的调用中添加errors =“ replace”会导致编解码器转换替换为'?' for any invalid characters in the input. 输入中的任何无效字符。

      with open(fileName, "rt", encoding="utf8", errors="replace") as csvFile:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM