[英]Catch UnicodeDecodeError exception while reading file line by line in Python 3
Consider the following code:考虑以下代码:
with open('file.txt', 'r') as f:
for line in f:
print(line)
In Python 3, the interpreter tries to decode the strings it reads, which might lead to exceptions like UnicodeDecodeError
.在 Python 3 中,解释器尝试解码它读取的字符串,这可能会导致像
UnicodeDecodeError
这样的异常。 These can of course be caught with a try ... except
block around the whole loop, but I would like to handle them on a per-line basis.这些当然可以通过
try ... except
捕获try ... except
在整个循环中阻塞,但我想在每行的基础上处理它们。
Question: Is there a way to directly catch and handle exceptions for each line that is read?问题:有没有办法直接捕获和处理读取的每一行的异常? Hopefully without changing the simple syntax of iterating over the file too much?
希望不要过多地更改迭代文件的简单语法?
The Pythonic way is probably to register an error handler with codecs.register_error_handler('special', handler)
and declare it in the open function: Pythonic 的方法可能是使用
codecs.register_error_handler('special', handler)
注册一个错误处理程序,并在 open 函数中声明它:
with open('file.txt', 'r', error='special') as f:
...
That way if there is an offending line, the handler
will the called with the UnicodeDecodeError
, and will be able to return a replacement string or re-raise the error.这样,如果有违规行,
handler
将使用UnicodeDecodeError
调用,并且能够返回替换字符串或重新引发错误。
If you want a more evident processing, an alternate way would be to open the file in binary mode and explicitely decode each line:如果您想要更明显的处理,另一种方法是以二进制模式打开文件并显式解码每一行:
with open('file.txt', 'rb') as f:
for bline in f:
try:
line = bline.decode()
print(line)
except UnicodeDecodeError as e:
# process error
Instead of employing a for
loop, you could call next
on the file-iterator yourself and catch the StopIteration
manually.您可以自己调用文件迭代器的
next
并手动捕获StopIteration
,而不是使用for
循环。
with open('file.txt', 'r') as f:
while True:
try:
line = next(f)
# code
except StopIteration:
break
except UnicodeDecodeError:
# code
Basing on @SergeBallesta's answer.基于@SergeBallesta 的回答。 Here's the simplest thing that should work.
这是应该工作的最简单的事情。
Instead of open()
, use codecs.open(..., errors='your choice')
.而不是
open()
,使用codecs.open(..., errors='your choice')
。 It can handle Unicode errors for you.它可以为您处理 Unicode 错误。
The list of error handler names includes错误处理程序名称列表包括
'replace'
: "Replace with a suitable replacement marker; Python will use the official U+FFFD REPLACEMENT CHARACTER for the built-in codecs on decoding, and '?''replace'
: "用合适的替换标记替换;Python 将使用官方的 U+FFFD REPLACEMENT CHARACTER 作为内置编解码器进行解码,而 '?' on encoding"关于编码”
which should handle the error and add a marker "there was something invalid here" to the text.它应该处理错误并在文本中添加一个标记“这里有一些无效的东西”。
import codecs
# ...
# instead of open('filename.txt'), do:
with codecs.open('filename.txt', 'rb', 'utf-8', errors='replace') as f:
for line in f:
# ....
Place your try-except catch inside the for loop, like so:将您的 try-except catch 放在 for 循环中,如下所示:
with open('file.txt', 'r') as f:
for line in f:
try:
print(line)
except:
print("uh oh")
# continue
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.