简体   繁体   English

在删除或替换指定字符的同时读取文件?

[英]Reading a file while dropping or replacing specified characters?

I have a large file that contains some NULL characters.我有一个包含一些 NULL 字符的大文件。 I'd like to read this file in Python, as if these NULLs aren't there.我想用 Python 读取这个文件,就好像这些 NULL 不存在一样。 I could read the entire file into an in-memory string and do a str.replace , but this is inefficient, especially given its total size (which can be in the multiple GBs).我可以将整个文件读入内存字符串并执行str.replace ,但这效率低下,尤其是考虑到它的总大小(可能是多个 GB)。

Is there an efficient way to read a file in Python, while dynamically dropping certain characters, or replacing them with others?有没有一种有效的方法可以在 Python 中读取文件,同时动态删除某些字符,或者用其他字符替换它们?

Open the file in binary mode and read it in chunks of suitable size.以二进制模式打开文件并以合适大小的块读取它。 Remove from each chunk undesired characters and write the resulting bytes to another file opened for writing.从每个块中删除不需要的字符并将结果字节写入另一个打开用于写入的文件。

This will work for \\x00 bytes, but will certainly fail if it's a text file with utf-8 encoding, where a single letter can take several bytes.这适用于\\x00字节,但如果它是使用 utf-8 编码的文本文件,则肯定会失败,其中单个字母可能需要几个字节。

This can be solved usingcodecs.open .这可以使用codecs.open解决。 The returned file-like object allows you to read approximate number of bytes in the given encoding.返回的类文件对象允许您read给定编码中的近似字节数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM