在 Python 3 中逐行读取文件时捕获 UnicodeDecodeError 异常

Question

Consider the following code:考虑以下代码：

with open('file.txt', 'r') as f:
    for line in f:
        print(line)

In Python 3, the interpreter tries to decode the strings it reads, which might lead to exceptions like UnicodeDecodeError .在 Python 3 中，解释器尝试解码它读取的字符串，这可能会导致像UnicodeDecodeError这样的异常。 These can of course be caught with a try ... except block around the whole loop, but I would like to handle them on a per-line basis.这些当然可以通过try ... except捕获try ... except在整个循环中阻塞，但我想在每行的基础上处理它们。

Question: Is there a way to directly catch and handle exceptions for each line that is read?问题：有没有办法直接捕获和处理读取的每一行的异常？ Hopefully without changing the simple syntax of iterating over the file too much?希望不要过多地更改迭代文件的简单语法？

Answer 1

The Pythonic way is probably to register an error handler with codecs.register_error_handler('special', handler) and declare it in the open function: Pythonic 的方法可能是使用codecs.register_error_handler('special', handler)注册一个错误处理程序，并在 open 函数中声明它：

with open('file.txt', 'r', error='special') as f:
    ...

That way if there is an offending line, the handler will the called with the UnicodeDecodeError , and will be able to return a replacement string or re-raise the error.这样，如果有违规行， handler将使用UnicodeDecodeError调用，并且能够返回替换字符串或重新引发错误。

If you want a more evident processing, an alternate way would be to open the file in binary mode and explicitely decode each line:如果您想要更明显的处理，另一种方法是以二进制模式打开文件并显式解码每一行：

with open('file.txt', 'rb') as f:
    for bline in f:
        try:
            line = bline.decode()
            print(line)
        except UnicodeDecodeError as e:
            # process error

Answer 2

Instead of employing a for loop, you could call next on the file-iterator yourself and catch the StopIteration manually.您可以自己调用文件迭代器的next并手动捕获StopIteration ，而不是使用for循环。

with open('file.txt', 'r') as f:
    while True:
        try:
            line = next(f)
            # code
        except StopIteration:
            break
        except UnicodeDecodeError:
            # code

Answer 3

Basing on @SergeBallesta's answer.基于@SergeBallesta 的回答。 Here's the simplest thing that should work.这是应该工作的最简单的事情。

Instead of open() , use codecs.open(..., errors='your choice') .而不是open() ，使用codecs.open(..., errors='your choice') 。 It can handle Unicode errors for you.它可以为您处理 Unicode 错误。

The list of error handler names includes错误处理程序名称列表包括

'replace' : "Replace with a suitable replacement marker; Python will use the official U+FFFD REPLACEMENT CHARACTER for the built-in codecs on decoding, and '?' 'replace' : "用合适的替换标记替换；Python 将使用官方的 U+FFFD REPLACEMENT CHARACTER 作为内置编解码器进行解码，而 '?' on encoding"关于编码”

which should handle the error and add a marker "there was something invalid here" to the text.它应该处理错误并在文本中添加一个标记“这里有一些无效的东西”。

import codecs

# ...

# instead of open('filename.txt'), do:
with codecs.open('filename.txt', 'rb', 'utf-8', errors='replace') as f:
    for line in f:
        # ....

Answer 4

Place your try-except catch inside the for loop, like so:将您的 try-except catch 放在 for 循环中，如下所示：

with open('file.txt', 'r') as f:
    for line in f:
      try:  
        print(line)
      except:
        print("uh oh")
        # continue

在 Python 3 中逐行读取文件时捕获 UnicodeDecodeError 异常

问题描述

4 个解决方案

解决方案1
10 已采纳 2017-11-23 10:27:07

解决方案2
6 2017-11-23 10:16:25

解决方案3
2 2020-07-30 18:57:55

解决方案4
-1 2017-11-23 10:17:05

在 Python 3 中逐行读取文件时捕获 UnicodeDecodeError 异常

问题描述

4 个解决方案

解决方案1 10 已采纳 2017-11-23 10:27:07

解决方案2 6 2017-11-23 10:16:25

解决方案3 2 2020-07-30 18:57:55

解决方案4 -1 2017-11-23 10:17:05

解决方案1
10 已采纳 2017-11-23 10:27:07

解决方案2
6 2017-11-23 10:16:25

解决方案3
2 2020-07-30 18:57:55

解决方案4
-1 2017-11-23 10:17:05