如何处理FileInput中无效的utf8？

Question

I have basically the following code: 我基本上有以下代码：

def main():
    for filename in fileinput.input():
        filename = filename.strip()
        process_file(filename)

The script takes a newline-separated list of file names as its input. 该脚本将以换行符分隔的文件名列表作为输入。 However, some of the file names contain invalid utf8, which causes fileinput.input() to implode. 但是，某些文件名包含无效的utf8，这会导致fileinput.input()内爆。 I've read about the surrogateescape error handler, which I think is what I want, but I don't know how to set the error handler for fileinput. 我已经读过有关surrogateescape错误处理程序的信息，我认为这是我想要的，但是我不知道如何为fileinput设置错误处理程序。

In short: how do I get fileinput to deal with invalid Unicode? 简而言之：如何获取fileinput以处理无效的Unicode？

Answer 1

filenames on POSIX may be arbitrary sequences of bytes (except b'\\0' and b'/' ) ie, no character encoding can decode them in the general case (that is why os.fsdecode() exists that uses surrogateescape error handler). POSIX上的文件名可以是任意字节序列（ b'\\0'和b'/'除外），即在一般情况下没有字符编码可以对其进行解码（这就是为什么存在使用surrogateescape错误处理程序的os.fsdecode()原因）。

You could use a binary mode to read the filenames then either skip undecodable filenames if the input shouldn't contain them or pass them as is (or os.fsdecode() ) to functions that expect filenames: 您可以使用二进制模式来读取文件名，然后如果输入内容中不包含无法解码的文件名，则跳过该文件名，或者将它们原样传递给需要文件名的函数（或os.fsdecode() ）：

for filename in fileinput.input(mode='rb'):
    process_file(os.fsdecode(filename).strip())

Beware, there were several known Python bugs related to using a binary mode and fileinput eg: 当心，有一些与使用二进制模式和文件fileinput有关的已知Python错误，例如：

Answer 2

Following documentation please use opening hook: 以下文档请使用打开挂钩：

def main():
for filename in fileinput.input(openhook=fileinput.hook_encoded("utf-8")):
    filename = filename.strip()
    process_file(filename)

如何处理FileInput中无效的utf8？

问题描述

2 个解决方案

解决方案1
1 已采纳 2016-02-25 14:38:25

解决方案2
0 2016-02-25 09:48:32

如何处理FileInput中无效的utf8？

问题描述

2 个解决方案

解决方案1 1 已采纳 2016-02-25 14:38:25

解决方案2 0 2016-02-25 09:48:32

解决方案1
1 已采纳 2016-02-25 14:38:25

解决方案2
0 2016-02-25 09:48:32