简体   繁体   English

如何处理FileInput中无效的utf8?

[英]How to deal with invalid utf8 in fileinput?

I have basically the following code: 我基本上有以下代码:

def main():
    for filename in fileinput.input():
        filename = filename.strip()
        process_file(filename)

The script takes a newline-separated list of file names as its input. 该脚本将以换行符分隔的文件名列表作为输入。 However, some of the file names contain invalid utf8, which causes fileinput.input() to implode. 但是,某些文件名包含无效的utf8,这会导致fileinput.input()内爆。 I've read about the surrogateescape error handler, which I think is what I want, but I don't know how to set the error handler for fileinput. 我已经读过有关surrogateescape错误处理程序的信息,我认为这是我想要的,但是我不知道如何为fileinput设置错误处理程序。

In short: how do I get fileinput to deal with invalid Unicode? 简而言之:如何获取fileinput以处理无效的Unicode?

filenames on POSIX may be arbitrary sequences of bytes (except b'\\0' and b'/' ) ie, no character encoding can decode them in the general case (that is why os.fsdecode() exists that uses surrogateescape error handler). POSIX上的文件名可以是任意字节序列( b'\\0'b'/'除外),即在一般情况下没有字符编码可以对其进行解码(这就是为什么存在使用surrogateescape错误处理程序的os.fsdecode()原因) 。

You could use a binary mode to read the filenames then either skip undecodable filenames if the input shouldn't contain them or pass them as is (or os.fsdecode() ) to functions that expect filenames: 您可以使用二进制模式来读取文件名,然后如果输入内容中不包含无法解码的文件名,则跳过该文件名 ,或者将它们原样传递给需要文件名的函数(或os.fsdecode() ):

for filename in fileinput.input(mode='rb'):
    process_file(os.fsdecode(filename).strip())

Beware, there were several known Python bugs related to using a binary mode and fileinput eg: 当心,有一些与使用二进制模式和文件fileinput有关的已知Python错误,例如:

Following documentation please use opening hook: 以下文档请使用打开挂钩:

def main():
for filename in fileinput.input(openhook=fileinput.hook_encoded("utf-8")):
    filename = filename.strip()
    process_file(filename)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM