简体   繁体   English

bz2的解压可以并行化吗?

[英]Is it possible to parallelize bz2's decompression?

I am using pythons bz2 module to generate (and compress) a large jsonl file (bzip2 compressed 17GB).我正在使用 pythons bz2模块生成(并压缩)一个大型 jsonl 文件(bzip2 压缩 17GB)。

However, when I later try to decompress it using pbzip2 it only seems to use one CPU-core for decompression, which is quite slow.但是,当我后来尝试使用 pbzip2 解压缩它时,它似乎只使用一个CPU 内核进行解压缩,这很慢。

When i compress it with pbzip2 it can leverage multiple cores on decompression.当我用 pbzip2 压缩它时,它可以在解压时利用多个内核。 Is there a way to compress within python in the pbzip2-compatible format?有没有办法在python中以pbzip2兼容格式压缩?

import bz2,sys
from Queue import Empty
#...
compressor = bz2.BZ2Compressor(9)
f = open(path, 'a')

    try:
        while 1:
            m = queue.get(True, 1*60)
            f.write(compressor.compress(m+"\n"))
    except Empty, e:
        pass
    except Exception as e:
        traceback.print_exc()
    finally:
        sys.stderr.write("flushing")
        f.write(compressor.flush())
        f.close()

A pbzip2 stream is nothing more than the concatenation of multiple bzip2 streams. pbzip2流只不过是多个bzip2流的串联。

An example using the shell:使用 shell 的示例:

bzip2 < /usr/share/dict/words > words_x_1.bz2
cat words_x_1.bz2{,,,,,,,,,} > words_x_10.bz2
time bzip2 -d < words_x_10.bz2 > /dev/null
time pbzip2 -d < words_x_10.bz2 > /dev/null

I've never used python's bz2 module, but it should be easy to close/reopen a stream in 'a' ppend mode, every so-many bytes, to get the same result.我从来没有使用过 python 的bz2模块,但是在'a' ppend 模式下关闭/重新打开一个流应该很容易,每这么多字节,以获得相同的结果。 Note that if BZ2File is constructed from an existing file-like object, closing the BZ2File will not close the underlying stream (which is what you want here).请注意,如果BZ2File是从现有的类文件对象构造的,则关闭BZ2File不会关闭底层流(这就是您在这里想要的)。

I haven't measured how many bytes is optimal for chunking, but I would guess every 1-20 megabytes - it definitely needs to be larger than the bzip2 block size (900k) though.我还没有测量有多少字节最适合分块,但我猜测每 1-20 兆字节 - 不过它肯定需要大于 bzip2 块大小(900k)。

Note also that if you record the compressed and uncompressed offsets of each chunk, you can do fairly efficient random access.另请注意,如果您记录每个块的压缩和未压缩偏移量,则可以进行相当有效的随机访问。 This is how the dictzip program works, though that is based on gzip .这就是dictzip程序的工作方式,尽管它基于gzip

If you absolutely must use pbzip2 on decompression this won't help you, but the alternative lbzip2 can perform multicore decompression of "normal" .bz2 files, such as those generated by Python's BZ2File or a traditional bzip2 command.如果您绝对必须在解压时使用pbzip2 ,这对您没有帮助,但替代的lbzip2可以执行“普通” .bz2文件的多核解压,例如由 Python 的BZ2File或传统bzip2命令生成的文件。 This avoids the limitation of pbzip2 you're describing, where it can only achieve parallel decompression if the file is also compressed using pbzip2 .这避免了您所描述的pbzip2的限制,如果文件也使用pbzip2压缩,它只能实现并行解压缩。 See https://lbzip2.org/ .请参阅https://lbzip2.org/

As a bonus, benchmarks suggest lbzip2 is substantially faster than pbzip2 , both on decompression (by 30%) and compression (by 40%) while achieving slightly superior compression ratios.作为奖励,基准测试表明lbzip2pbzip2 ,无论是在解压缩(30%)还是压缩(40%)上,同时实现略高的压缩率。 Further, its peak RAM usage is less than 50% of the RAM used by pbzip2 .此外,它的峰值 RAM 使用量不到pbzip2使用的 RAM 的 50%。 See https://vbtechsupport.com/1614/ .请参阅https://vbtechsupport.com/1614/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM