[英]How to obtain random access of a gzip compressed file
According to this FAQ on zlib.net it is possible to: 根据zlib.net上的这个FAQ,可以:
access data randomly in a compressed stream
在压缩流中随机访问数据
I know about the module Bio.bgzf of Biopyton 1.60 , which: 我知道Biopyton 1.60的Bio.bgzf模块,其中:
supports reading and writing BGZF files (Blocked GNU Zip Format), a variant of GZIP with efficient random access, most commonly used as part of the BAM file format and in tabix.
支持读写BGZF文件(Blocked GNU Zip Format),这是GZIP的一种变体,具有高效的随机访问,最常用作BAM文件格式的一部分和tabix。 This uses Python's zlib library internally, and provides a simple interface like Python's gzip library.
它在内部使用Python的zlib库,并提供一个简单的接口,如Python的gzip库。
But for my use case I don't want to use that format. 但对于我的用例,我不想使用那种格式。 Basically I want something, which emulates the code below:
基本上我想要一些东西,它模仿下面的代码:
import gzip
large_integer_new_line_start = 10**9
with gzip.open('large_file.gz','rt') as f:
f.seek(large_integer_new_line_start)
but with the efficiency offered by the native zlib.net to provide random access to the compressed stream. 但是本机zlib.net提供的效率可以提供对压缩流的随机访问。 How do I leverage that random access capability in Python?
如何利用Python中的随机访问功能?
I gave up on doing random access on a gzipped file using Python. 我放弃了使用Python对gzip压缩文件进行随机访问。 Instead I converted my gzipped file to a block gzipped file with a block compression/decompression utility on the command line:
相反,我在命令行上使用块压缩/解压缩实用程序将我的gzip压缩文件转换为块gzip压缩文件:
zcat large_file.gz | bgzip > large_file.bgz
Then I used BioPython and tell to get the virtual_offset of line number 1 million of the bgzipped file. 然后我使用BioPython并告诉我获取bgzipped文件的行号为100万的virtual_offset。 And then I was able to rapidly seek the virtual_offset afterwards:
之后我能够迅速寻找virtual_offset:
from Bio import bgzf
file='large_file.bgz'
handle = bgzf.BgzfReader(file)
for i in range(10**6):
handle.readline()
virtual_offset = handle.tell()
line1 = handle.readline()
handle.close()
handle = bgzf.BgzfReader(file)
handle.seek(virtual_offset)
line2 = handle.readline()
handle.close()
assert line1==line2
I would like to also point to the SO answer by Mark Adler here on examples/zran.c in the zlib distribution. 我还想指出Mark Adler在zlib发行版中的examples / zran.c上的回答 。
The indexed_gzip program might be what you wanted. indexed_gzip程序可能就是你想要的。 It also uses
zran.c
under the hood. 它还使用
zran.c
下的zran.c
If you just want to access the file from a random point can't you just do: 如果您只是想从随机点访问该文件,您不能这样做:
from random import randint
with open(filename) as f:
f.seek(0, 2)
size = f.tell()
f.seek(randint(0, size), 2)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.