简体   繁体   English

Python gzip.open .tell()具有线性增加因子,使其变慢

[英]Python gzip.open .tell() has a linear increasing factor making it slow

Using Python 3.3.5, I have a code that looks like: 使用Python 3.3.5,我有一个代码如下:

with gzip.open(fname, mode='rb') as fh:
    fh.seek(savedPos)
    for line in fh:
        # some work is done
        savedPos = fh.tell()

The work being done on each row is already quite taxing on the system, So I wasn't hoping for great numbers. 在每一行上完成的工作已经在系统上非常繁重,所以我并不希望有很多人。 But I threw in a debug counter and got the following result: 但我扔进调试计数器并得到以下结果:

48 rows/sec
28 rows/sec
19 rows/sec
15 rows/sec
13 rows/sec
13 rows/sec
9 rows/sec
10 rows/sec
9 rows/sec
9 rows/sec
8 rows/sec
8 rows/sec
8 rows/sec
8 rows/sec
7 rows/sec
7 rows/sec
7 rows/sec
7 rows/sec
5 rows/sec
...

Which tells me something was off, so I put the fh.tell() in the debug-counter/timer function, making so that fh.tell() only executed once a second and got a stable 65 rows/sec . 这告诉我一些东西是关闭的,所以我把fh.tell()放在debug-counter / timer函数中,使得fh.tell()只执行一次并获得稳定的65行/秒

Am I completely off the shelf or shouldn't fh.tell() be extremely quick? 我是完全下架还是不应该fh.tell()非常快? or is this a side-affect of gzip alone? 或者这是gzip单独的副作用?

I used to store the file-position manually but it bugged out occasionally due to different file-endings, encoding issues etc so I figured fh.tell() would be more accurate. 我曾经手动存储文件位置,但由于文件结尾,编码问题等原因,它偶尔也会fh.tell()所以我认为fh.tell()会更准确。

Are there alternatives or can you speed up fh.tell() some how? 有替代方案还是可以加快fh.tell()的速度?

My experience with zlib (albeit using it from C rather than python, but I suspect the issue is the same) is that seeking is what is slow. 我使用zlib的经验(虽然使用它来自C而不是python,但我怀疑问题是一样的)是寻求是慢的。 zlib doesn't keep track of where in the file it is, so if you seek it has to uncompress from the beginning in order to count how many uncompressed bytes forward it should seek to. zlib不会跟踪它在文件中的位置,所以如果你寻找它必须从头开始解压缩,以便计算它应该寻求的前进多少未压缩字节。

In other words, reading or writing sequentially is fine. 换句话说,顺序读取或写入很好。 If you have to seek, you're in for a world of hurt. 如果你必须寻求,那么你就是一个受伤的世界。

I rather doubt that you can expect fh.seek(...) to perform well. 我更怀疑你可以期待fh.seek(...)表现良好。

gzip uses a compression algorithm where the way things are compressed depends on the entire history of the data that preceded it. gzip使用压缩算法,其中压缩事物的方式取决于它之前的数据的整个历史记录。 So have an efficient seek operation you would also have to restore the internal state of the decoder. 因此,有一个有效的seek操作,你还必须恢复解码器的内部状态。

In any case, here is the code for the seek method: ( lines 435-442 ) 在任何情况下,这里是seek方法的代码:( 第435-442行

   elif self.mode == READ:
        if offset < self.offset:
            # for negative seek, rewind and do positive seek
            self.rewind()
        count = offset - self.offset
        for i in xrange(count // 1024):
            self.read(1024)
        self.read(count % 1024)

So seeking is performed by just performing read calls - ie reading and decompressing the data until it's at the correct file position, and if you seek backwards it just rewinds and reads forward from the start of the file. 因此,只需执行read调用即可执行搜索 - 即读取和解压缩数据,直到它处于正确的文件位置,如果向后搜索,则只需从文件的开头进行倒带和向前读取。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM