如何获取gzip压缩文件的随机访问权限

Question

According to this FAQ on zlib.net it is possible to: 根据zlib.net上的这个FAQ，可以：

access data randomly in a compressed stream 在压缩流中随机访问数据

I know about the module Bio.bgzf of Biopyton 1.60 , which: 我知道Biopyton 1.60的Bio.bgzf模块，其中：

supports reading and writing BGZF files (Blocked GNU Zip Format), a variant of GZIP with efficient random access, most commonly used as part of the BAM file format and in tabix. 支持读写BGZF文件（Blocked GNU Zip Format），这是GZIP的一种变体，具有高效的随机访问，最常用作BAM文件格式的一部分和tabix。 This uses Python's zlib library internally, and provides a simple interface like Python's gzip library. 它在内部使用Python的zlib库，并提供一个简单的接口，如Python的gzip库。

But for my use case I don't want to use that format. 但对于我的用例，我不想使用那种格式。 Basically I want something, which emulates the code below: 基本上我想要一些东西，它模仿下面的代码：

import gzip
large_integer_new_line_start = 10**9
with gzip.open('large_file.gz','rt') as f:
    f.seek(large_integer_new_line_start)

but with the efficiency offered by the native zlib.net to provide random access to the compressed stream. 但是本机zlib.net提供的效率可以提供对压缩流的随机访问。 How do I leverage that random access capability in Python? 如何利用Python中的随机访问功能？

Answer 1

I gave up on doing random access on a gzipped file using Python. 我放弃了使用Python对gzip压缩文件进行随机访问。 Instead I converted my gzipped file to a block gzipped file with a block compression/decompression utility on the command line: 相反，我在命令行上使用块压缩/解压缩实用程序将我的gzip压缩文件转换为块gzip压缩文件：

zcat large_file.gz | bgzip > large_file.bgz

Then I used BioPython and tell to get the virtual_offset of line number 1 million of the bgzipped file. 然后我使用BioPython并告诉我获取bgzipped文件的行号为100万的virtual_offset。 And then I was able to rapidly seek the virtual_offset afterwards: 之后我能够迅速寻找virtual_offset：

from Bio import bgzf

file='large_file.bgz'

handle = bgzf.BgzfReader(file)
for i in range(10**6):
    handle.readline()
virtual_offset = handle.tell()
line1 = handle.readline()
handle.close()

handle = bgzf.BgzfReader(file)
handle.seek(virtual_offset)
line2 = handle.readline()
handle.close()

assert line1==line2

I would like to also point to the SO answer by Mark Adler here on examples/zran.c in the zlib distribution. 我还想指出Mark Adler在zlib发行版中的examples / zran.c上的回答。

Answer 2

You are looking for dictzip.py , part of the serpento package. 您正在寻找dictzip.py ，它是serpento软件包的一部分。 However, you have to compress the files with dictzip , which is a random seekable backward compatible variant of the gzip compression. 但是，您必须使用dictzip压缩文件，这是gzip压缩的随机可搜索向后兼容变体。

Answer 3

The indexed_gzip program might be what you wanted. indexed_gzip程序可能就是你想要的。 It also uses zran.c under the hood. 它还使用zran.c下的zran.c

Answer 4

If you just want to access the file from a random point can't you just do: 如果您只是想从随机点访问该文件，您不能这样做：

from random import randint

with open(filename) as f:
    f.seek(0, 2)
    size = f.tell()
    f.seek(randint(0, size), 2)

如何获取gzip压缩文件的随机访问权限

问题描述

4 个解决方案

解决方案1
5 2014-04-14 17:54:12

解决方案2
0 2016-06-29 07:02:07

解决方案3
0 2019-03-09 15:24:36

解决方案4
-4 2014-04-08 23:27:02

如何获取gzip压缩文件的随机访问权限

问题描述

4 个解决方案

解决方案1 5 2014-04-14 17:54:12

解决方案2 0 2016-06-29 07:02:07

解决方案3 0 2019-03-09 15:24:36

解决方案4 -4 2014-04-08 23:27:02

解决方案1
5 2014-04-14 17:54:12

解决方案2
0 2016-06-29 07:02:07

解决方案3
0 2019-03-09 15:24:36

解决方案4
-4 2014-04-08 23:27:02