简体   繁体   中英

Reading a gzip file backwards

I want understand what is the most efficient (speed and performance) way to read a gzip file backwards without loading the contents of the entire file in to memory.

Here is what I do currently, but is not efficient for really large files:

file = 'huge_file.log.gz'
import gzip
if file.endswith('gz'):
    f = gzip.open(file)
    # reverse the file contents
    reverse_file_list = reversed(f.read().decode('utf-8').split('\n'))

I see there are some solutions in stackoverflow and codestate which do negative seeks but negative seeks are not supported when files are opened in binary mode as with gzip.open

Links: Most efficient way to search the last x lines of a file in python

http://code.activestate.com/recipes/439045/

So the solution fails for what I want to accomplish.

The only solution may be to unpack the file to disk and reverse the line order. It uses twice the disk space, but not memory.

You can accomplish both these steps at once with:

gzip -cd huge_file.log.gz | tac > huge_file.log.reversed

Then you can read and process normally.

There really isn't a good way. The gzip (deflate) compressed data format is inherently serial both in the use of Huffman codes and the use of matching strings in the previous 32K.

If you can't get it all into memory, you will either need to a) decompress it to disk, and reverse it using seeks on the uncompressed form, or b) do one decompression pass through the gzip file creating effectively random access entry points for chunks small enough to keep in memory and then do a second decompression pass backwards, reversing each chunk.

a) can be done with tac, as suggested in @Jud's answer, since tac will create a temporary file on disk to hold the uncompressed contents.

b) is complicated, and requires an intimate understanding of the deflate format. It also requires that you save 32K of history for each entry point, either in memory or on disk.

Unfortunately You have to parse gz files from the beginning, and it may be timeconsuming to parse them all to the end. I use a list buffer, that just pops the first item if reverse=True and BSIZE has been reached, it will always hold the last BSIZE matches of the file and in one pass:

   BSIZE = 100; searchstr= "match in gzfile"; n = 0; buffer = []; reversed = True
   # gzf is an *.gz file in a directory
   with gzip.open(files['path'] + '/' + gzf, 'rt') as f:
        for line in f:
            if re.search(searchstr, line):
                n += 1
                buffer.append(line.strip())
                if n >= BSIZE and not reversed:
                    break
                elif n >= BSIZE:
                    buffer.pop(0)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM