简体   繁体   English

如何获取gzip压缩文件的随机访问权限

[英]How to obtain random access of a gzip compressed file

According to this FAQ on zlib.net it is possible to: 根据zlib.net上的这个FAQ,可以:

access data randomly in a compressed stream 在压缩流中随机访问数据

I know about the module Bio.bgzf of Biopyton 1.60 , which: 我知道Biopyton 1.60Bio.bgzf模块,其中:

supports reading and writing BGZF files (Blocked GNU Zip Format), a variant of GZIP with efficient random access, most commonly used as part of the BAM file format and in tabix. 支持读写BGZF文件(Blocked GNU Zip Format),这是GZIP的一种变体,具有高效的随机访问,最常用作BAM文件格式的一部分和tabix。 This uses Python's zlib library internally, and provides a simple interface like Python's gzip library. 它在内部使用Python的zlib库,并提供一个简单的接口,如Python的gzip库。

But for my use case I don't want to use that format. 但对于我的用例,我不想使用那种格式。 Basically I want something, which emulates the code below: 基本上我想要一些东西,它模仿下面的代码:

import gzip
large_integer_new_line_start = 10**9
with gzip.open('large_file.gz','rt') as f:
    f.seek(large_integer_new_line_start)

but with the efficiency offered by the native zlib.net to provide random access to the compressed stream. 但是本机zlib.net提供的效率可以提供对压缩流的随机访问。 How do I leverage that random access capability in Python? 如何利用Python中的随机访问功能?

I gave up on doing random access on a gzipped file using Python. 我放弃了使用Python对gzip压缩文件进行随机访问。 Instead I converted my gzipped file to a block gzipped file with a block compression/decompression utility on the command line: 相反,我在命令行上使用块压缩/解压缩实用程序将我的gzip压缩文件转换为块gzip压缩文件:

zcat large_file.gz | bgzip > large_file.bgz

Then I used BioPython and tell to get the virtual_offset of line number 1 million of the bgzipped file. 然后我使用BioPython并告诉我获取bgzipped文件的行号为100万的virtual_offset。 And then I was able to rapidly seek the virtual_offset afterwards: 之后我能够迅速寻找virtual_offset:

from Bio import bgzf

file='large_file.bgz'

handle = bgzf.BgzfReader(file)
for i in range(10**6):
    handle.readline()
virtual_offset = handle.tell()
line1 = handle.readline()
handle.close()

handle = bgzf.BgzfReader(file)
handle.seek(virtual_offset)
line2 = handle.readline()
handle.close()

assert line1==line2

I would like to also point to the SO answer by Mark Adler here on examples/zran.c in the zlib distribution. 我还想指出Mark Adlerzlib发行中的examples / zran.c上的回答

You are looking for dictzip.py , part of the serpento package. 您正在寻找dictzip.py ,它是serpento软件包的一部分。 However, you have to compress the files with dictzip , which is a random seekable backward compatible variant of the gzip compression. 但是,您必须使用dictzip压缩文件,这是gzip压缩的随机可搜索向后兼容变体。

The indexed_gzip program might be what you wanted. indexed_gzip程序可能就是你想要的。 It also uses zran.c under the hood. 它还使用zran.c下的zran.c

If you just want to access the file from a random point can't you just do: 如果您只是想从随机点访问该文件,您不能这样做:

from random import randint

with open(filename) as f:
    f.seek(0, 2)
    size = f.tell()
    f.seek(randint(0, size), 2)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 压缩为 Gzip 的 Json 大文件的随机索引 - Random indexing of large Json file compressed as Gzip 如何判断文件是否是 gzip 压缩的? - How to tell if a file is gzip compressed? 如何从python中的gzip压缩文件中获取随机行而不将其读入内存 - How to get a random line from within a gzip compressed file in python without reading it into memory 如何在不提取所有内容的情况下访问 gzip 压缩文件夹的子文件夹中的文件? - How can I access a file that is in a subfolder of a gzip-compressed folder without extracting everything? 如何将 gzip 压缩的 json 行文件读入 PySpark dataframe? - How to read a gzip compressed json lines file into PySpark dataframe? 如何读取gzip压缩的CZI图像? - How to read gzip compressed CZI image? 我们可以将压缩文件 (Gzip) 直接推送到 Kinesis Streams 中吗? - can we push the compressed file (Gzip) directly into Kinesis Streams? 如何解码python中HTTP响应中返回的gzip压缩数据? - How to decode the gzip compressed data returned in a HTTP Response in python? 如何在 python 中使用 gzip 将 header 添加到压缩字符串中? - How can I add header to compressed string with gzip in python? 如何在python中解码使用gzip压缩的源代码 - How to decode a source code which is compressed with gzip in python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM