简体   繁体   English

如何使用python逐块解压缩多个文件.gz

[英]How to decompress multiple file .gz chunk by chunk with python

I'm trying to decompress a very large .gz file (commoncrawl web extract) during download, but zlib is stopping after the first file (the file seems to be many concatenated gz file). 我试图在下载过程中解压缩非常大的.gz文件(commoncrawl Web提取),但是zlib在第一个文件(该文件似乎是许多串联的gz文件)之后停止。

import requests,json,zlib
fn="crawl-data/CC-MAIN-2017-04/segments/1484560279933.49/warc/CC-MAIN-20170116095119-00381-ip-10-171-10-70.ec2.internal.warc.gz"
fn="https://commoncrawl.s3.amazonaws.com/"+fn
r = requests.get(fn, stream=True)
d = zlib.decompressobj(zlib.MAX_WBITS | 16)
for chunk in r.iter_content(chunk_size=2048):
    if chunk:
        outstr = d.decompress(chunk)
        print(len(chunk),chunk[:10].hex(),len(outstr),len(d.unused_data))

all the chunks go to "unused_data" and are not decompressed, only the first one. 所有块都进入“ unused_data”并且不解压缩,只有第一个。

It works great when piping to zcat : 当管道输送到zcat时,它的效果很好:

curl https://commoncrawl.s3... | zcat | ....

You pretty much gave the answer to your own question. 您几乎给了自己的问题答案。 You are dealing with a concatenation of gzip streams (which is itself a valid gzip stream), so when you get eof from the decompression object, you need to fire up a new decompressobj for each, using the unused_data you noted from the last one to start the next one. 您正在处理用gzip流的连接(这本身就是一个有效的gzip流),所以当你得到eof从解压的对象,你需要启动一个新的decompressobj每个使用unused_data您从最后一个注意到开始下一个。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM