简体   繁体   English

在 Python 3 中逐行读取文件时捕获 UnicodeDecodeError 异常

[英]Catch UnicodeDecodeError exception while reading file line by line in Python 3

Consider the following code:考虑以下代码:

with open('file.txt', 'r') as f:
    for line in f:
        print(line)

In Python 3, the interpreter tries to decode the strings it reads, which might lead to exceptions like UnicodeDecodeError .在 Python 3 中,解释器尝试解码它读取的字符串,这可能会导致像UnicodeDecodeError这样的异常。 These can of course be caught with a try ... except block around the whole loop, but I would like to handle them on a per-line basis.这些当然可以通过try ... except捕获try ... except在整个循环中阻塞,但我想在每行的基础上处理它们。

Question: Is there a way to directly catch and handle exceptions for each line that is read?问题:有没有办法直接捕获和处理读取的每一行的异常? Hopefully without changing the simple syntax of iterating over the file too much?希望不要过多地更改迭代文件的简单语法?

The Pythonic way is probably to register an error handler with codecs.register_error_handler('special', handler) and declare it in the open function: Pythonic 的方法可能是使用codecs.register_error_handler('special', handler)注册一个错误处理程序,并在 open 函数中声明它:

with open('file.txt', 'r', error='special') as f:
    ...

That way if there is an offending line, the handler will the called with the UnicodeDecodeError , and will be able to return a replacement string or re-raise the error.这样,如果有违规行, handler将使用UnicodeDecodeError调用,并且能够返回替换字符串或重新引发错误。

If you want a more evident processing, an alternate way would be to open the file in binary mode and explicitely decode each line:如果您想要更明显的处理,另一种方法是以二进制模式打开文件并显式解码每一行:

with open('file.txt', 'rb') as f:
    for bline in f:
        try:
            line = bline.decode()
            print(line)
        except UnicodeDecodeError as e:
            # process error

Instead of employing a for loop, you could call next on the file-iterator yourself and catch the StopIteration manually.您可以自己调用文件迭代器的next并手动捕获StopIteration ,而不是使用for循环。

with open('file.txt', 'r') as f:
    while True:
        try:
            line = next(f)
            # code
        except StopIteration:
            break
        except UnicodeDecodeError:
            # code

Basing on @SergeBallesta's answer.基于@SergeBallesta 的回答。 Here's the simplest thing that should work.这是应该工作的最简单的事情。

Instead of open() , use codecs.open(..., errors='your choice') .而不是open() ,使用codecs.open(..., errors='your choice') It can handle Unicode errors for you.它可以为您处理 Unicode 错误。

The list of error handler names includes错误处理程序名称列表包括

'replace' : "Replace with a suitable replacement marker; Python will use the official U+FFFD REPLACEMENT CHARACTER for the built-in codecs on decoding, and '?' 'replace' : "用合适的替换标记替换;Python 将使用官方的 U+FFFD REPLACEMENT CHARACTER 作为内置编解码器进行解码,而 '?' on encoding"关于编码”

which should handle the error and add a marker "there was something invalid here" to the text.它应该处理错误并在文本中添加一个标记“这里有一些无效的东西”。

import codecs

# ...

# instead of open('filename.txt'), do:
with codecs.open('filename.txt', 'rb', 'utf-8', errors='replace') as f:
    for line in f:
        # ....

Place your try-except catch inside the for loop, like so:将您的 try-except catch 放在 for 循环中,如下所示:

with open('file.txt', 'r') as f:
    for line in f:
      try:  
        print(line)
      except:
        print("uh oh")
        # continue

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM