简体   繁体   中英

How to get uncompressed size of a > 4GB .gz file in python

So there is this super interesting thread already about getting original size of a .gz file. Turns out the size one can get from the 4 file ending bytes are 'just' there to make sure extraction was successful. However: Its fine to rely on it IF the extracted data size is below 2**32 bytes. ie. 4 GB.

Now IF there are more than 4 GB of uncompressed data there must be multiple members in the .gz! The last 4 bytes only indicating the uncompressed size of the last chunk!

So how do we get the ending bytes of the other chunks ? Reading the gzip specs I don't see a length of the

+=======================+
|...compressed blocks...|
+=======================+

Ok. Must depend on the CM - compression method. Which is probably deflate . Let's see the RFC about it . There on page 11 it says there is a LEN attribute for "Non-compressed blocks" but it gets funky when they tell about the Compressed ones ...

I can imagine something like

full_size = os.path.getsize(gz_path)
gz = gzip.open(gz_path)
pos = 0
size = 0
while True:
    try:
        head_len = get_header_length(gz, pos)
        block_len = get_block_length(gz, pos + head_len)
        size += get_orig_size(gz, pos + head_len + block_len)
        pos += head_len + block_len + 8
    except:
        break
print('uncompressed size of "%s" is: %i bytes' % (gz_path, full_size)

But how to get_block_length ?!? :|

This was probably never intended because ... "stream data". But I don't wanna give up now. One big bummer already: Even 7zip shows such a big .gz with the exact uncompressed size of just the very last 4 bytes.

Does someone have another idea?

First off, no, there do not need to be multiple members. There is no limit on the length of a gzip member. If the uncompressed data is more than 4 GB, then the last four bytes simply represents that length modulo 2 32 . A gzip file with more than 4 GB of uncompressed data is in fact very likely to be a single member.

Second, the fact that you can have multiple members is true even for small gzip files. The uncompressed data does not need to be more than 4 GB for the last four bytes of the file to be useless.

The only way to reliably determine the amount of uncompressed data in a gzip file is to decompress it. You don't have to write the data out, but you have to process the entire gzip file and count the number of uncompressed bytes.

I'm coming here to leave an estimate of what you are looking for. The good answer is the one given by Mark Adler: the only reliable way to determine the uncompressed size of a gzip file is by actually decompressing it.

But I'm working with an estimate that will usually give good results, but it can fail at the boundaries. The assumptions are:

  • there is only one stream in the file
  • the stream have a similar compression ratio compared to the whole file

The idea is to get the compression ratio of the beginning of the file (get a 1M sample, decompress and measure), use it to extrapolate the uncompressed size from the compressed size, and finally, substitute the 32 least significant bits by the size module 32 obtained from the gzip stream. The caveat comes at the multiple of 4GiB boudaries, as it could over/underestimate the size and give an estimate +/-4GiB displaced.

The code would be:

from io import SEEK_END
import os
import pack
import zlib

def estimate_uncompressed_gz_size(filename):
    # From the input file, get some data:
    # - the 32 LSB from the gzip stream
    # - 1MB sample of compressed data
    # - compressed file size
    with open(filename, "rb") as gz_in:
        sample = gz_in.read(1000000)
        gz_in.seek(-4, SEEK_END)
        lsb = struct.unpack('I', gz_in.read(4))[0]
        file_size = os.fstat(gz_in.fileno()).st_size

    # Estimate the total size by decompressing the sample to get the
    # compression ratio so we can extrapolate the uncompressed size
    # using the compression ratio and the real file size
    dobj = zlib.decompressobj(31)
    d_sample = dobj.decompress(sample)

    compressed_len = len(sample) - len(dobj.unconsumed_tail)
    decompressed_len = len(d_sample)

    estimate = file_size * decompressed_len / compressed_len

    # 32 LSB to zero
    mask = ~0xFFFFFFFF

    # Kill the 32 LSB to be substituted by the data read from the file
    adjusted_estimate = (estimate & mask) | lsb

    return adjusted_estimate

Workarounds around the stated caveats could be to check the difference between estimate and adjusted estimate, and if bigger than 2GiB, add/substract 4GiB accordingly. But at the end, it will be always an estimate, not a reliable number.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM