[英]Is it possible to parallelize bz2's decompression?
I am using pythons bz2 module to generate (and compress) a large jsonl file (bzip2 compressed 17GB).我正在使用 pythons bz2模块生成(并压缩)一个大型 jsonl 文件(bzip2 压缩 17GB)。
However, when I later try to decompress it using pbzip2 it only seems to use one CPU-core for decompression, which is quite slow.但是,当我后来尝试使用 pbzip2 解压缩它时,它似乎只使用一个CPU 内核进行解压缩,这很慢。
When i compress it with pbzip2 it can leverage multiple cores on decompression.当我用 pbzip2 压缩它时,它可以在解压时利用多个内核。 Is there a way to compress within python in the pbzip2-compatible format?
有没有办法在python中以pbzip2兼容格式压缩?
import bz2,sys
from Queue import Empty
#...
compressor = bz2.BZ2Compressor(9)
f = open(path, 'a')
try:
while 1:
m = queue.get(True, 1*60)
f.write(compressor.compress(m+"\n"))
except Empty, e:
pass
except Exception as e:
traceback.print_exc()
finally:
sys.stderr.write("flushing")
f.write(compressor.flush())
f.close()
A pbzip2
stream is nothing more than the concatenation of multiple bzip2
streams. pbzip2
流只不过是多个bzip2
流的串联。
An example using the shell:使用 shell 的示例:
bzip2 < /usr/share/dict/words > words_x_1.bz2
cat words_x_1.bz2{,,,,,,,,,} > words_x_10.bz2
time bzip2 -d < words_x_10.bz2 > /dev/null
time pbzip2 -d < words_x_10.bz2 > /dev/null
I've never used python's bz2
module, but it should be easy to close/reopen a stream in 'a'
ppend mode, every so-many bytes, to get the same result.我从来没有使用过 python 的
bz2
模块,但是在'a'
ppend 模式下关闭/重新打开一个流应该很容易,每这么多字节,以获得相同的结果。 Note that if BZ2File
is constructed from an existing file-like object, closing the BZ2File
will not close the underlying stream (which is what you want here).请注意,如果
BZ2File
是从现有的类文件对象构造的,则关闭BZ2File
不会关闭底层流(这就是您在这里想要的)。
I haven't measured how many bytes is optimal for chunking, but I would guess every 1-20 megabytes - it definitely needs to be larger than the bzip2 block size (900k) though.我还没有测量有多少字节最适合分块,但我猜测每 1-20 兆字节 - 不过它肯定需要大于 bzip2 块大小(900k)。
Note also that if you record the compressed and uncompressed offsets of each chunk, you can do fairly efficient random access.另请注意,如果您记录每个块的压缩和未压缩偏移量,则可以进行相当有效的随机访问。 This is how the
dictzip
program works, though that is based on gzip
.这就是
dictzip
程序的工作方式,尽管它基于gzip
。
If you absolutely must use pbzip2
on decompression this won't help you, but the alternative lbzip2
can perform multicore decompression of "normal" .bz2
files, such as those generated by Python's BZ2File
or a traditional bzip2
command.如果您绝对必须在解压时使用
pbzip2
,这对您没有帮助,但替代的lbzip2
可以执行“普通” .bz2
文件的多核解压,例如由 Python 的BZ2File
或传统bzip2
命令生成的文件。 This avoids the limitation of pbzip2
you're describing, where it can only achieve parallel decompression if the file is also compressed using pbzip2
.这避免了您所描述的
pbzip2
的限制,如果文件也使用pbzip2
压缩,它只能实现并行解压缩。 See https://lbzip2.org/ .请参阅https://lbzip2.org/ 。
As a bonus, benchmarks suggest lbzip2
is substantially faster than pbzip2
, both on decompression (by 30%) and compression (by 40%) while achieving slightly superior compression ratios.作为奖励,基准测试表明
lbzip2
比pbzip2
,无论是在解压缩(30%)还是压缩(40%)上,同时实现略高的压缩率。 Further, its peak RAM usage is less than 50% of the RAM used by pbzip2
.此外,它的峰值 RAM 使用量不到
pbzip2
使用的 RAM 的 50%。 See https://vbtechsupport.com/1614/ .请参阅https://vbtechsupport.com/1614/ 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.