简体   繁体   English

用Python增量读取大型多部分压缩文本文件

[英]Incrementally Read Large Multipart Zipped Text File in Python

I have a very large zip file that is split into multiple parts as split archives, with a single file within the archive. 我有一个非常大的zip文件,该文件分为多个部分作为拆分的归档文件,而归档文件中只有一个文件。 I do not have enough resources to combine these archives together or extract them (the raw text file is nearly 1TB). 我没有足够的资源来将这些档案合并或提取(原始文本文件将近1TB)。

I would like to parse the text file line by line, ideally using something like this: 我想逐行分析文本文件,理想情况下使用类似以下的内容:

import zipfile
for zipfilename in filenames:
    with zipfile.ZipFile(zipfilename) as z:
        with z.open(...) as f:
            for line in f:
                print line

Is this possible? 这可能吗? If so, how can I read the text file: 如果是这样,我如何读取文本文件:

  1. Without using too much memory (loading the whole file into memory is obviously out of the question) 无需使用太多内存(将整个文件加载到内存中显然是不可能的)
  2. Without extracting any of the zip files 无需解压缩任何zip文件
  3. (Ideally) Without combining the zip files (理想情况下)不合并zip文件

Thank you in advance for your help. 预先感谢您的帮助。

I'll take a stab. 我会刺。

If your zip files are the so-called "split archives" according to the Zip file format, you won't be able to read them either with Python's zipfile library nor with the unzip terminal command. 如果您的zip文件是根据Zip文件格式的所谓“拆分归档文件”,则您将无法使用Python的zipfile库或unzip terminal命令读取它们。

If, on the other hand, you are dealing with a single zip archive that has been split using the split command or a similar byte-splitting device, you might be able to extract and read its contents on the fly in Python. 另一方面,如果您正在处理一个使用split命令或类似的字节分割设备分割的zip存档,则可以使用Python快速提取和读取其内容。

You will have to write a "file-like" custom class that will accept the seek() and read() methods (and possibly others) and perform them on the split chunks. 您将必须编写一个“类似于文件的”自定义类,该类将接受seek()和read()方法(可能还有其他方法)并在拆分的块上执行它们。

seek() will need to compute which zip file to read, open it (if it's not the current file still open) and perform a seek() on it using the difference in offsets. seek()将需要计算要读取的zip文件,将其打开(如果尚未打开当前文件),并使用偏移量的差异对其进行一次seek()。

read() will read from the chunk that is currently open, dealing with the End of file condition, which will cause it to open the next chunk and complete the read on it. read()将从当前打开的块中读取数据,处理文件结束条件,这将导致它打开下一个块并完成对它的读取。

After you write and test this class, it will just be a matter of calling the ZipFile constructor passing an instance of your class as the "virtual zip" file object to open. 在编写并测试了该类之后,只需调用ZipFile构造函数即可将类的实例作为“虚拟zip”文件对象传递给打开。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM