Python gzip.open .tell() has a linear increasing factor making it slow

Question

Using Python 3.3.5, I have a code that looks like:

with gzip.open(fname, mode='rb') as fh:
    fh.seek(savedPos)
    for line in fh:
        # some work is done
        savedPos = fh.tell()

The work being done on each row is already quite taxing on the system, So I wasn't hoping for great numbers. But I threw in a debug counter and got the following result:

48 rows/sec
28 rows/sec
19 rows/sec
15 rows/sec
13 rows/sec
13 rows/sec
9 rows/sec
10 rows/sec
9 rows/sec
9 rows/sec
8 rows/sec
8 rows/sec
8 rows/sec
8 rows/sec
7 rows/sec
7 rows/sec
7 rows/sec
7 rows/sec
5 rows/sec
...

Which tells me something was off, so I put the fh.tell() in the debug-counter/timer function, making so that fh.tell() only executed once a second and got a stable 65 rows/sec .

Am I completely off the shelf or shouldn't fh.tell() be extremely quick? or is this a side-affect of gzip alone?

I used to store the file-position manually but it bugged out occasionally due to different file-endings, encoding issues etc so I figured fh.tell() would be more accurate.

Are there alternatives or can you speed up fh.tell() some how?

Answer 1

My experience with zlib (albeit using it from C rather than python, but I suspect the issue is the same) is that seeking is what is slow. zlib doesn't keep track of where in the file it is, so if you seek it has to uncompress from the beginning in order to count how many uncompressed bytes forward it should seek to.

In other words, reading or writing sequentially is fine. If you have to seek, you're in for a world of hurt.

Answer 2

I rather doubt that you can expect fh.seek(...) to perform well.

gzip uses a compression algorithm where the way things are compressed depends on the entire history of the data that preceded it. So have an efficient seek operation you would also have to restore the internal state of the decoder.

In any case, here is the code for the seek method: ( lines 435-442 )

   elif self.mode == READ:
        if offset < self.offset:
            # for negative seek, rewind and do positive seek
            self.rewind()
        count = offset - self.offset
        for i in xrange(count // 1024):
            self.read(1024)
        self.read(count % 1024)

So seeking is performed by just performing read calls - ie reading and decompressing the data until it's at the correct file position, and if you seek backwards it just rewinds and reads forward from the start of the file.

Python gzip.open .tell() has a linear increasing factor making it slow

Question

2 answers

solution1
3 2014-11-19 09:53:32

solution2
3 2014-11-19 10:29:03

Python gzip.open .tell() has a linear increasing factor making it slow

Question

2 answers

solution1 3 2014-11-19 09:53:32

solution2 3 2014-11-19 10:29:03

solution1
3 2014-11-19 09:53:32

solution2
3 2014-11-19 10:29:03