简体   繁体   English

在python中逐行读取一个大的压缩文本文件

[英]Read a large zipped text file line by line in python

I am trying to use zipfile module to read a file in an archive.我正在尝试使用 zipfile 模块来读取存档中的文件。 the uncompressed file is ~3GB and the compressed file is 200MB.未压缩文件约为 3GB,压缩文件为 200MB。 I don't want them in memory as I process the compressed file line by line.我不希望它们在内存中,因为我逐行处理压缩文件。 So far I have noticed a memory overuse using the following code:到目前为止,我注意到使用以下代码的内存过度使用:

import zipfile
f = open(...)
z = zipfile.ZipFile(f)
for line in zipfile.open(...).readlines()
  print line

I did it in C# using the SharpZipLib:我使用 SharpZipLib 在 C# 中做到了:

var fStream = File.OpenRead("...");
var unzipper = new ICSharpCode.SharpZipLib.Zip.ZipFile(fStream);
var dataStream =  unzipper.GetInputStream(0);

dataStream is uncompressed. dataStream 未压缩。 I can't seem to find a way to do it in Python.我似乎找不到在 Python 中做到这一点的方法。 Help will be appreciated.帮助将不胜感激。

Python file objects provide iterators, which will read line by line. Python 文件对象提供迭代器,它将逐行读取。 file.readlines() reads them all and returns a list - which means it needs to read everything into memory. file.readlines()读取所有内容并返回一个列表 - 这意味着它需要将所有内容读入内存。 The better approach (which should always be preferred over readlines() ) is to just loop over the object itself, Eg:更好的方法(应该始终优先于readlines() )是只循环对象本身,例如:

import zipfile
with zipfile.ZipFile(...) as z:
    with z.open(...) as f:
        for line in f:
            print line

Note my use of the with statement - file objects are context managers, and the with statement lets us easily write readable code that ensures files are closed when the block is exited (even upon exceptions).请注意我使用with语句- 文件对象是上下文管理器,with 语句让我们可以轻松编写可读代码,确保在退出块时关闭文件(即使出现异常)。 This, again, should always be used when dealing with files.同样,在处理文件时应该始终使用它。

If the inner directory and the subdirectory filenames in the zipped file don't matter, you can try this:如果压缩文件中的内部目录和子目录文件名无关紧要,您可以尝试以下操作:

from zipfile import ZipFile
from io import TextIOWrapper

def zip_open(filename):
    """Wrapper function that for zipfiles."""
    with ZipFile(filename) as zipfin:
        for filename in zipfin.namelist():
            return TextIOWrapper(zipfin.open(filename))

# Usage of the zip_open function)
with zip_open('myzipball.zip') as fin:
    for line in fin:
        print(line)

The zip_open works well when the zipfile contains a single or multiple files without subdirectories.当 zipfile 包含一个或多个没有子目录的文件时, zip_open可以很好地工作。 Not sure if the simple for filename in zipfin.namelist() works if there are complex subdirectories structure in the zipped file though.如果压缩文件中有复杂的子目录结构, for filename in zipfin.namelist()是否有效。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM