简体   繁体   中英

Get uncompressed size of a .gz file in python

Using gzip, tell() returns the offset in the uncompressed file.
In order to show a progress bar, I want to know the original (uncompressed) size of the file.
Is there an easy way to find out?

Uncompressed size is stored in the last 4 bytes of the gzip file. We can read the binary data and convert it to an int. (This will only work for files under 4GB)

import struct

def getuncompressedsize(filename):
    with open(filename, 'rb') as f:
        f.seek(-4, 2)
        return struct.unpack('I', f.read(4))[0]

The gzip format specifies a field called ISIZE that:

This contains the size of the original (uncompressed) input data modulo 2^32.

In gzip.py , which I assume is what you're using for gzip support, there is a method called _read_eof defined as such:

def _read_eof(self):
    # We've read to the end of the file, so we have to rewind in order
    # to reread the 8 bytes containing the CRC and the file size.
    # We check the that the computed CRC and size of the
    # uncompressed data matches the stored values.  Note that the size
    # stored is the true file size mod 2**32.
    self.fileobj.seek(-8, 1)
    crc32 = read32(self.fileobj)
    isize = U32(read32(self.fileobj))   # may exceed 2GB
    if U32(crc32) != U32(self.crc):
        raise IOError, "CRC check failed"
    elif isize != LOWU32(self.size):
        raise IOError, "Incorrect length of data produced"

There you can see that the ISIZE field is being read, but only to to compare it to self.size for error detection. This then should mean that GzipFile.size stores the actual uncompressed size. However, I think it's not exposed publicly, so you might have to hack it in to expose it. Not so sure, sorry.

I just looked all of this up right now, and I haven't tried it so I could be wrong. I hope this is of some use to you. Sorry if I misunderstood your question.

Despite what the other answers say, the last four bytes are not a reliable way to get the uncompressed length of a gzip file. First, there may be multiple members in the gzip file, so that would only be the length of the last member. Second, the length may be more than 4 GB, in which case the last four bytes represent the length modulo 2 32 . Not the length.

However for what you want, there is no need to get the uncompressed length. You can instead base your progress bar on the amount of input consumed, as compared to the length of the gzip file, which is readily obtained. For typical homogenous data, that progress bar would show exactly the same thing as a progress bar based instead on the uncompressed data.

Unix 方式:通过 subprocess.call / os.popen 使用“gunzip -l file.gz”,捕获并解析其输出。

.gz 的最后 4 个字节保存文件的原始大小

I am not sure about performance, but this could be achieved without knowing gzip magic by using:

with gzip.open(filepath, 'rb') as file_obj:
    file_size = file_obj.seek(0, io.SEEK_END)

This should also work for other (compressed) stream readers like bz2 or the plain open .

EDIT: as suggested in the comments, 2 in second line was replaced by io.SEEK_END , which is definitely more readable and probably more future-proof.

EDIT: Works only in Python 3.

    f = gzip.open(filename)
    # kludge - report uncompressed file position so progess bars
    # don't go to 400%
    f.tell = f.fileobj.tell

Looking at the source for the gzip module, I see that the underlying file object for GzipFile seems to be fileobj . So:

mygzipfile = gzip.GzipFile()
...
mygzipfile.fileobj.tell()

?

Maybe it would be good to do some sanity checking before doing that, like checking that the attribute exists with hasattr .

Not exactly a public API, but...

GzipFile.size 存储未压缩的大小,但它只会在您读取文件时增加,因此您应该更喜欢 len(fd.read()) 而不是非公开的 GzipFile.size。

Here is a Python2 version for @norok 's solution

import gzip, io

with oepn("yourfile.gz", "rb") as f:
    prev, cur = 0, f.seek(1000000, io.SEEK_CUR)
    while prev < cur:
        prev, cur = cur, f.seek(1000000, io.SEEK_CUR)

filesize = cur

Note that just like f.seek(0, io.SEEK_END) this is slow for large files, but it will overcome the 4GB size limitation of the faster solutions suggested here

import gzip

File = gzip.open("input.gz", "r")
Size = gzip.read32(File)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM